1 Motivation and purpose

The study attempts to predict the house price in Ames, Iowa. To predict the house price, several machine learning methods will be used such as linear model (LM), Generalized Linear Model (GLM), Generalized Additive Model (GAM), Neural Network (NN), Support Vector Machine (SVM) and Optimization. Yet, before predicting price and applying models, it’s vital to explore and understand the data.

2 Data comprehension

There are 82 variables and 2930 observations in the data set. The description of each feature can be found under Variables description file. Because of many variables, we first examine and make a intuition analysis on every variable and its relevance to SalePrice for LM, GAM and NN. Although this process can be time-consuming, it sheds the light to grasp the data better. How did we do that? Below are the step explanation:

  • Reading and understanding every variable description
  • Noting this information: variable name and importance to sale price (Yes/No). The decision of Yes/No is grounded on intuition, expectation and some house-price related articles. To be more specific, we question these following questions: which factor matters when we buy a house? How important would this factor be? Does this factor appear in other variables?

At the first glimpse, we marked these might-be-relevant features to the SalePrice: MS.Zoning, Lot.Area, Street, Lot.Shape, Neighborhood, Bldg.Type, House.Style, Overall.Qual, Year.Built, Foundation, Total.Bsmt.SF, TotRms.AbvGrd, Mo.Sold, Yr.Sold, Garage.Cars, Sale.Condition. After considering of potential features, let’s inspect the response variable in the next part.

2.1 The response variable - Sale price

One of the primary purposes of this analysis is to predict the housing sales price based on appropriate features. First, we plot the histogram and box plot of the sale price (i.e. the dependent variable).

# Histogram
histogram_sale_price <- ggplot(d_house, mapping = aes(x = SalePrice)) +
  geom_histogram(aes(y =..density..), color="black", fill = "grey") +
  geom_density(alpha =.2, color ="blue") +
  scale_x_continuous(labels = comma) +
  labs(title = "Histogram of sale price") +
  my_theme_general

# Boxplot
boxplot_sale_price <- ggplot(d_house, mapping = aes(x = "", y = SalePrice)) +
  geom_boxplot(color = "black", fill = "grey") +
  scale_y_continuous(labels = comma) +
  coord_flip() +
   xlab("") +
  my_theme_general

# Display two plots together
egg::ggarrange(histogram_sale_price, boxplot_sale_price, heights = 2:1)

# Descriptive statistics summary of SalePrice
stats_SalePrice <- psych::describe(d_house$SalePrice)

The distribution of sale prices displays right-skewed with a long tail. The skew of 1.7417153 confirms that the data are highly confirmed. Besides, the Kurtosis is 5.1025881, i.e. Leptokurtic with a longer distribution and fatter tails. Some houses are very expensive. We have to carefully consider whether these expensive houses are outliers. Instead of dropping outliers, we apply the log transformation for sale prices.

From the beginning to this point, we just follow the intuition for variable selection. However, this method can be too subjective. Hence, a more objective and systematic way need to be executed as the section below. Nevertheless, we still consider our initial thoughts as a double-check criteria and critical thinking in supplement with the multivariate study.

3 Numerical variable analysis

3.1 Correlation matrix

# Overview correlation matrix of all numerical variables
d_house_numeric <- d_house %>% 
  select(where(is.numeric))

# Missing value check in numerical features -> not so many NA
round(colMeans(is.na(d_house_numeric)), 2) %>%  
  as.data.frame()
##                    .
## MS.SubClass     0.00
## Lot.Frontage    0.17
## Lot.Area        0.00
## Overall.Qual    0.00
## Overall.Cond    0.00
## Year.Built      0.00
## Year.Remod.Add  0.00
## Mas.Vnr.Area    0.01
## BsmtFin.SF.1    0.00
## BsmtFin.SF.2    0.00
## Bsmt.Unf.SF     0.00
## Total.Bsmt.SF   0.00
## X1st.Flr.SF     0.00
## X2nd.Flr.SF     0.00
## Low.Qual.Fin.SF 0.00
## Gr.Liv.Area     0.00
## Bsmt.Full.Bath  0.00
## Bsmt.Half.Bath  0.00
## Full.Bath       0.00
## Half.Bath       0.00
## Bedroom.AbvGr   0.00
## Kitchen.AbvGr   0.00
## TotRms.AbvGrd   0.00
## Fireplaces      0.00
## Garage.Yr.Blt   0.05
## Garage.Cars     0.00
## Garage.Area     0.00
## Wood.Deck.SF    0.00
## Open.Porch.SF   0.00
## Enclosed.Porch  0.00
## X3Ssn.Porch     0.00
## Screen.Porch    0.00
## Pool.Area       0.00
## Misc.Val        0.00
## Mo.Sold         0.00
## Yr.Sold         0.00
## SalePrice       0.00
# Overview histograms of all numerical variables
# hist.data.frame(d_house_numeric)

# Create correlation matrix, use = complete to obmit NA, because not so many NA
m_cor <- cor(d_house_numeric, use = "complete") 

# Correlogram
corrplot(m_cor, method = "color", addCoef.col="black",
         order = "alphabet", number.cex=0.35, tl.cex = 0.55)

The correlation matrix shows some dark blue areas where variables are highly correlated to each other. To be more specific, Garage.Area is highly correlated Garage.Cars or other pairs such Gr.Liv.Area with TotRms.AbvGrd; Garage.Yr.Blt with Year.Built. These pairs indicate the strong correlation and multi-collinearity. When we check these variables’ meaning, they give almost similar information. Here, the correlation matrix is a crucial method to detect the multi-collinear problems in the feature selection process.

# Calculate correlation matrix for all numerical features against SalePrice
m_cor_SalePrice <- round(
  cor(d_house_numeric, d_house["SalePrice"], use = "complete", ) ,2)

# Transform into tibble and sort correlation in a descending order
d_cor_SalePrice <- tibble(var_name = row.names(m_cor_SalePrice), 
       SalePrice = m_cor_SalePrice[, "SalePrice"]) %>% 
  arrange(desc(SalePrice))
d_cor_SalePrice 
## # A tibble: 37 x 2
##    var_name       SalePrice
##    <chr>              <dbl>
##  1 SalePrice           1   
##  2 Overall.Qual        0.8 
##  3 Gr.Liv.Area         0.71
##  4 Garage.Cars         0.66
##  5 Total.Bsmt.SF       0.65
##  6 Garage.Area         0.65
##  7 X1st.Flr.SF         0.64
##  8 Year.Built          0.56
##  9 Full.Bath           0.56
## 10 Year.Remod.Add      0.54
## # ... with 27 more rows

Remarkably, Lot.Area, the lot size in square feet is one of the variable we considered as important at the first sight but its correlation to the SalePrice is only 0.31. Let’s look at the scatter plot of these two variables:

# Scatter plot SalePrice & Lot.Area
scatter_price_lot <- ggplot(d_house, aes(x = Lot.Area, y = SalePrice)) + 
  geom_point() 

# Scatter plot SalePrice & Lot.Area but change the x & y range
scatter_price_lot2 <- ggplot(d_house, aes(x = Lot.Area, y = SalePrice)) + 
  geom_point() +
  xlim(0, 50000) +
  ylim(0, 500000)

# Plot the two side-by-side
egg::ggarrange(scatter_price_lot, scatter_price_lot2, ncol=2, nrow = 1)

The two plots show the same information but the right scatter displays the removal of outliers. The Lot.Area data points stay mostly at the smaller value and their price varies vertically for the same lot space.

3.2 Top variables correlated to SalePrice

The top numerical features with the absolute highest correlation with the SalePrice are now selected and we produce a SalePrice correlation matrix as below:

# Make df of top 10 numerical variables with the absolute highest correlation to SalePrice
d_cor_top_10 <- d_cor_SalePrice %>%  
  mutate(abs_corr = abs(SalePrice)) %>% 
  arrange(desc(abs_corr)) %>% 
  slice(1:10)

# Make vector with variable name
v_top_10_cor <- d_cor_top_10 %>% 
  # Select the variable names and convert to vector 
  pull(var_name)

# Correlation matrix of top 10
m_cor_top_10 <- cor(d_house_numeric %>% 
               select(all_of(v_top_10_cor)), 
             use = "complete") 

# Correlogram
corrplot(m_cor_top_10, method = "color", addCoef.col="black", type = 'lower',
         order = "alphabet", number.cex=0.55, tl.cex = 0.55)

# Quick look on the scatter plot in pair for the top 10
ggpairs(d_house_numeric %>% select(all_of(v_top_10_cor)))

From the new correlation matrix, some important findings are found:

  • Garage.Area and Garage.Cars are correlated highly with the dependent variable but they have the same information. Here, we keep the Garage.Cars as its higher correlation.
  • SalePrice is strongly correlated to Gr.Liv.Area, Overall.Qual, Total.Bsmt.SF
  • Full.Bath means full bathrooms above grade/ground. It seems to be strange here
  • BsmtFin.SF.1 and Total.Bsmt.SF display the related information, we choose Total.Bsmt.SF in this case
  • TotRms.AbvGrd and Gr.Liv.Area are similar. Here, we choose Gr.Liv.Area
  • Year.Built: 0.55 - moderately correlated to SalePrice

After choosing the variables, let’s make scatter plots to check the relationship between SalePrice and other variables

# Selected numerical variables
d_select_num <- d_house_numeric %>% 
  select(SalePrice, Garage.Cars, Gr.Liv.Area, Overall.Qual, Total.Bsmt.SF, Full.Bath, Year.Built)

# list of selected variables 
l_vars <- c("Garage.Cars",  "Gr.Liv.Area", "Overall.Qual", "Total.Bsmt.SF", "Full.Bath",  "Year.Built")

# empty list for scatter plot
plotlist <- list()

# For loop to get several scatter plots at once with method = lm
for (idx in 1:length(l_vars)) {
  this_var <- l_vars[idx]
  p <- ggplot(d_select_num, aes_string(x = this_var, y = "SalePrice")) +
    geom_point() +
    geom_smooth(method = lm)
  plotlist[[idx]] <- p
} 

# 6 scatter plots at once
egg::ggarrange(plotlist[[1]], plotlist[[2]], 
               plotlist[[3]], plotlist[[4]],
               plotlist[[5]], plotlist[[6]], ncol=3, nrow = 2)

In general, some plots indicate that all of these variables have a linear positive relationship with the dependent variable SalePrice.

4 Missing values

It’s important to inspect the missing values in each variable. How is the ratio of missing value in each feature? What does the missing value mean in each case? Are missing values random? These questions are vital when we deal with the missing value. Removing them immediately without reflecting can cause data size reduction and bias.

# Number of NA values in each column
d_number_missing_values <- colSums(is.na(d_house)) %>%  
  as.data.frame() 
colnames(d_number_missing_values) <- c("number")

# Percentage of NA values in each column
d_perc_missing_values <- round(colMeans(is.na(d_house)), 2) %>%  
  as.data.frame()
colnames(d_perc_missing_values) <- c("ratio")
d_perc_missing_values
##                 ratio
## MS.SubClass      0.00
## MS.Zoning        0.00
## Lot.Frontage     0.17
## Lot.Area         0.00
## Street           0.00
## Alley            0.93
## Lot.Shape        0.00
## Land.Contour     0.00
## Utilities        0.00
## Lot.Config       0.00
## Land.Slope       0.00
## Neighborhood     0.00
## Condition.1      0.00
## Condition.2      0.00
## Bldg.Type        0.00
## House.Style      0.00
## Overall.Qual     0.00
## Overall.Cond     0.00
## Year.Built       0.00
## Year.Remod.Add   0.00
## Roof.Style       0.00
## Roof.Matl        0.00
## Exterior.1st     0.00
## Exterior.2nd     0.00
## Mas.Vnr.Type     0.00
## Mas.Vnr.Area     0.01
## Exter.Qual       0.00
## Exter.Cond       0.00
## Foundation       0.00
## Bsmt.Qual        0.03
## Bsmt.Cond        0.03
## Bsmt.Exposure    0.03
## BsmtFin.Type.1   0.03
## BsmtFin.SF.1     0.00
## BsmtFin.Type.2   0.03
## BsmtFin.SF.2     0.00
## Bsmt.Unf.SF      0.00
## Total.Bsmt.SF    0.00
## Heating          0.00
## Heating.QC       0.00
## Central.Air      0.00
## Electrical       0.00
## X1st.Flr.SF      0.00
## X2nd.Flr.SF      0.00
## Low.Qual.Fin.SF  0.00
## Gr.Liv.Area      0.00
## Bsmt.Full.Bath   0.00
## Bsmt.Half.Bath   0.00
## Full.Bath        0.00
## Half.Bath        0.00
## Bedroom.AbvGr    0.00
## Kitchen.AbvGr    0.00
## Kitchen.Qual     0.00
## TotRms.AbvGrd    0.00
## Functional       0.00
## Fireplaces       0.00
## Fireplace.Qu     0.49
## Garage.Type      0.05
## Garage.Yr.Blt    0.05
## Garage.Finish    0.05
## Garage.Cars      0.00
## Garage.Area      0.00
## Garage.Qual      0.05
## Garage.Cond      0.05
## Paved.Drive      0.00
## Wood.Deck.SF     0.00
## Open.Porch.SF    0.00
## Enclosed.Porch   0.00
## X3Ssn.Porch      0.00
## Screen.Porch     0.00
## Pool.Area        0.00
## Pool.QC          1.00
## Fence            0.80
## Misc.Feature     0.96
## Misc.Val         0.00
## Mo.Sold          0.00
## Yr.Sold          0.00
## Sale.Type        0.00
## Sale.Condition   0.00
## SalePrice        0.00
# A summary table of missing values in each column
d_summary_missing_value <- cbind(d_number_missing_values, d_perc_missing_values)

# Change the summary missing values to tibble
d_summary_missing_value <- tibble(var_name = row.names(d_summary_missing_value), 
       n_missing = d_summary_missing_value[, "number"],
       perc_missing = d_summary_missing_value[, "ratio"]) %>% 
  arrange(desc(perc_missing)) 
d_summary_missing_value 
## # A tibble: 80 x 3
##    var_name      n_missing perc_missing
##    <chr>             <dbl>        <dbl>
##  1 Pool.QC            2917         1   
##  2 Misc.Feature       2824         0.96
##  3 Alley              2732         0.93
##  4 Fence              2358         0.8 
##  5 Fireplace.Qu       1422         0.49
##  6 Lot.Frontage        490         0.17
##  7 Garage.Type         157         0.05
##  8 Garage.Yr.Blt       159         0.05
##  9 Garage.Finish       157         0.05
## 10 Garage.Qual         158         0.05
## # ... with 70 more rows
# Exclude the columns with 49% missing value
d_house <- d_house %>% 
  select(-c(Alley, Fireplace.Qu, Pool.QC, Fence, Misc.Feature))

# Remove the 1 NA rows from two important columns
d_house <- d_house %>% drop_na(Garage.Cars, Total.Bsmt.SF)

After examine the missing ratio, here are some decisions:

  • Excluding all the columns with more than 49% of missing values, namely Pool.QC, Misc.Feature, Alley, Fence and Fireplace.Qu.Filling NA value is not applied because these variables are not so important and not the main factors to think about when we buy a house.
  • Group of variables with “Garage”: 5% are missing in these columns. Since we chose only Garage.Cars for the model in the previous part, we do not care about the rest.
  • Group of variables with “Bsmt”: similar principle as the group of “Garage”

5 Categorical variables

In this section, we check the effects of different categorical features that are potentially relevant for the model. Before applying function drop1(), we visualize SalePrice with several categorical variables

5.1 Visualizations of categorical variables

# Make a function to draw all box plots
fill <- "#FFFFFF"
line <- "#000000"

boxplot_theme <- function(variable_input, title, variable) {
  output <- ggplot(d_house, aes(x = variable_input, y = SalePrice)) +
    geom_boxplot(fill = fill, colour = line) +
    ggtitle(title) +
    theme_minimal() +
    scale_y_continuous(labels = comma) +
    labs(x = paste0("\n", variable), y = "Price\n") + 
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),
          legend.position = "bottom", legend.direction = "horizontal",
          legend.box = "horizontal",
          legend.key.size = unit(1, "cm"),
          plot.title = element_text(family="Tahoma", hjust = 0.5),
          text = element_text(family = "Tahoma"),
          axis.title = element_text(size = 10),
          legend.text = element_text(size = 9),
          legend.title=element_text(face = "bold"))

  return(output)
}  

5.1.1 MS.Zoning

SalePrice seems to vary among different MSZoning. Here, MSZoning identifies the general zoning classification of the sales such as agriculture, commercial, industrial. Especially Residential with low density (RL) and residential with medium density (RM) are strongly right skewed.

# Boxplot price in ms zonings
title <- "General Zoning Classification"
variable <- "MS Zoning"
boxplot_theme(d_house$MS.Zoning, title, variable)

5.1.2 Lot Shape

The general shapes of the property appear to be similar. With the exception that properties with a regular shape has a lower value compared to an irregular shape.

# Boxplot price in different lot shapes
title <- "General shape of property"
variable <- "Lot Shape"
boxplot_theme(d_house$Lot.Shape, title, variable)

5.1.3 Neighborhood

We identified especially in the neighborhoods strong variances regarding the prices. Especially Northridge (NoRidge) has a high price per property and shows some outliers in the direction of an higher price.

# Boxplot price in different neighborhoods
title <- "Physical locations within Ames city limits"
variable <- "Neighborhood"
boxplot_theme(d_house$Neighborhood, title, variable)

5.1.4 Bldg.Type

The type of dwelling or house shows no radical differences for each category’s median price. However, Single-family Detached (1Fam) has the highest potential and many tails to the higher prices.

# Boxplot price in different building types
title <- "Type of building"
variable <- "Building Type"
boxplot_theme(d_house$Bldg.Type, title, variable)

5.1.5 House.Style

The same for House Style. The sales price for one story or two story is not strong impacted by the house style.

# Boxplot price in different house styles
title <- "Style of dwelling"
variable <- "HouseStyle"
boxplot_theme(d_house$House.Style, title, variable)

5.1.6 Foundation

The Foundation variable consists of Brick & Tile, Cinder Block, Poured Contrete, Slab, Stone, Wood. The price seems to differ in different type of foundation but again, this can be a factor that was influenced by the Year.Built.

# Boxplot price in different house styles
title <- "Foundation"
variable <- "Foundation"
boxplot_theme(d_house$Foundation, title, variable)

5.1.7 Functional

The Functional variable comprises different categories such as major deduction 1, 2, minor deduction 1,2 etc. The price seems to differ in different type of functions, yet, it can be related to the overall quality variables.

# Boxplot price in different house styles
title <- "Functional"
variable <- "Functional"
boxplot_theme(d_house$Functional, title, variable)

5.1.8 Sale conditions

This variable indicate the condition of sale such as normal sale, abnormal sale (trade, short sale), sale between family members, adjoing land purchases etc. The median of the sales price is the highest among the Sales condition which was not completed when last assessed (associated with new homes).

title <- "Condition of sale"
variable <- "Sale Condition"
boxplot_theme(d_house$Sale.Condition, title, variable)

5.2 Testing the effect of categorical variables

After several visualizations and descriptive analysis, let’s fit the model with all highly potential categorical variables and then apply drop1() function to test their significance.

# Select all categorical might be relevant
d_house_selected_categorical <- d_house %>% 
  select(MS.Zoning, Neighborhood, Street, House.Style, Sale.Condition)

# Fit model with only these categorical variables
lm_cate_1 <- lm(log(SalePrice) ~ MS.Zoning + Neighborhood + Street + Foundation + 
                Bldg.Type + House.Style + Sale.Condition + Functional, 
                data = d_house)

# Check their significance effects
drop1(lm_cate_1, test = "F")
## Single term deletions
## 
## Model:
## log(SalePrice) ~ MS.Zoning + Neighborhood + Street + Foundation + 
##     Bldg.Type + House.Style + Sale.Condition + Functional
##                Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                      156.59 -8448.4                      
## MS.Zoning       6     4.432 161.03 -8378.7 13.5131 3.559e-15 ***
## Neighborhood   27    96.005 252.60 -7102.4 65.0546 < 2.2e-16 ***
## Street          1     0.201 156.79 -8446.7  3.6748   0.05534 .  
## Foundation      5    10.577 167.17 -8267.0 38.7046 < 2.2e-16 ***
## Bldg.Type       4     7.069 163.66 -8327.1 32.3327 < 2.2e-16 ***
## House.Style     7     8.655 165.25 -8304.9 22.6223 < 2.2e-16 ***
## Sale.Condition  5     4.875 161.47 -8368.6 17.8393 < 2.2e-16 ***
## Functional      7     2.208 158.80 -8421.4  5.7708 1.179e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the p-value obtained, Street does not seem to have a relevant effect on the response variable.

5.3 Predictors of focus

As we have seen so far, the linear model describes relationship between the response variable \(Y\) and the predictors \(X_1,X_2,...X_n\). However, not all explanatory variables play an important role in predicting the random variable. Now, the question is if omitting a predictor would affect the degree to which the model fits the data, substantially deteriorated or not? As we will not fit the model with 74 predictors, it’s vital to choose some focus variables, which are more potential and relevant for our model.

5.3.1 Stepwise selection

Command regsubsets from the library leaps runs the whole procedure automatically. Wherever TRUE is written, the corresponding explanatory variable is contained in model.

# Choose all variables exclude some variables starting with "Bsmt" because cannot run stepwise selection
d_exclude <- d_house %>% 
  select(!starts_with("Bsmt"))

# Fit all variables
lm_full <- lm(log(SalePrice) ~ ., data = d_exclude)

# Forward selection
reg_model <- regsubsets(log(SalePrice) ~ ., data = d_exclude, 
                        na.action=na.exclude, method = "forward", nvmax = 11)
## Reordering variables and trying again:
reg_sum <- summary(reg_model)

reg_sum$which
##    (Intercept) MS.SubClass MS.ZoningC (all) MS.ZoningFV MS.ZoningI (all)
## 1         TRUE       FALSE            FALSE       FALSE            FALSE
## 2         TRUE       FALSE            FALSE       FALSE            FALSE
## 3         TRUE       FALSE            FALSE       FALSE            FALSE
## 4         TRUE       FALSE            FALSE       FALSE            FALSE
## 5         TRUE       FALSE            FALSE       FALSE            FALSE
## 6         TRUE       FALSE            FALSE       FALSE            FALSE
## 7         TRUE       FALSE            FALSE       FALSE            FALSE
## 8         TRUE       FALSE            FALSE       FALSE            FALSE
## 9         TRUE       FALSE            FALSE       FALSE            FALSE
## 10        TRUE       FALSE            FALSE       FALSE            FALSE
## 11        TRUE       FALSE            FALSE       FALSE            FALSE
## 12        TRUE       FALSE            FALSE       FALSE            FALSE
##    MS.ZoningRH MS.ZoningRL MS.ZoningRM Lot.Frontage Lot.Area StreetPave
## 1        FALSE       FALSE       FALSE        FALSE    FALSE      FALSE
## 2        FALSE       FALSE       FALSE        FALSE    FALSE      FALSE
## 3        FALSE       FALSE       FALSE        FALSE    FALSE      FALSE
## 4        FALSE       FALSE       FALSE        FALSE    FALSE      FALSE
## 5        FALSE       FALSE       FALSE        FALSE    FALSE      FALSE
## 6        FALSE       FALSE       FALSE        FALSE    FALSE      FALSE
## 7        FALSE        TRUE       FALSE        FALSE    FALSE      FALSE
## 8        FALSE        TRUE       FALSE        FALSE    FALSE      FALSE
## 9        FALSE        TRUE       FALSE        FALSE    FALSE      FALSE
## 10       FALSE        TRUE       FALSE        FALSE    FALSE      FALSE
## 11       FALSE        TRUE       FALSE        FALSE    FALSE      FALSE
## 12       FALSE        TRUE       FALSE        FALSE    FALSE      FALSE
##    Lot.ShapeIR2 Lot.ShapeIR3 Lot.ShapeReg Land.ContourHLS Land.ContourLow
## 1         FALSE        FALSE        FALSE           FALSE           FALSE
## 2         FALSE        FALSE        FALSE           FALSE           FALSE
## 3         FALSE        FALSE        FALSE           FALSE           FALSE
## 4         FALSE        FALSE        FALSE           FALSE           FALSE
## 5         FALSE        FALSE        FALSE           FALSE           FALSE
## 6         FALSE        FALSE        FALSE           FALSE           FALSE
## 7         FALSE        FALSE        FALSE           FALSE           FALSE
## 8         FALSE        FALSE        FALSE           FALSE           FALSE
## 9         FALSE        FALSE        FALSE           FALSE           FALSE
## 10        FALSE        FALSE        FALSE           FALSE           FALSE
## 11        FALSE        FALSE        FALSE           FALSE           FALSE
## 12        FALSE        FALSE        FALSE           FALSE           FALSE
##    Land.ContourLvl UtilitiesNoSeWa UtilitiesNoSewr Lot.ConfigCulDSac
## 1            FALSE           FALSE           FALSE             FALSE
## 2            FALSE           FALSE           FALSE             FALSE
## 3            FALSE           FALSE           FALSE             FALSE
## 4            FALSE           FALSE           FALSE             FALSE
## 5            FALSE           FALSE           FALSE             FALSE
## 6            FALSE           FALSE           FALSE             FALSE
## 7            FALSE           FALSE           FALSE             FALSE
## 8            FALSE           FALSE           FALSE             FALSE
## 9            FALSE           FALSE           FALSE             FALSE
## 10           FALSE           FALSE           FALSE             FALSE
## 11           FALSE           FALSE           FALSE             FALSE
## 12           FALSE           FALSE           FALSE             FALSE
##    Lot.ConfigFR2 Lot.ConfigFR3 Lot.ConfigInside Land.SlopeMod Land.SlopeSev
## 1          FALSE         FALSE            FALSE         FALSE         FALSE
## 2          FALSE         FALSE            FALSE         FALSE         FALSE
## 3          FALSE         FALSE            FALSE         FALSE         FALSE
## 4          FALSE         FALSE            FALSE         FALSE         FALSE
## 5          FALSE         FALSE            FALSE         FALSE         FALSE
## 6          FALSE         FALSE            FALSE         FALSE         FALSE
## 7          FALSE         FALSE            FALSE         FALSE         FALSE
## 8          FALSE         FALSE            FALSE         FALSE         FALSE
## 9          FALSE         FALSE            FALSE         FALSE         FALSE
## 10         FALSE         FALSE            FALSE         FALSE         FALSE
## 11         FALSE         FALSE            FALSE         FALSE         FALSE
## 12         FALSE         FALSE            FALSE         FALSE         FALSE
##    NeighborhoodBlueste NeighborhoodBrDale NeighborhoodBrkSide
## 1                FALSE              FALSE               FALSE
## 2                FALSE              FALSE               FALSE
## 3                FALSE              FALSE               FALSE
## 4                FALSE              FALSE               FALSE
## 5                FALSE              FALSE               FALSE
## 6                FALSE              FALSE               FALSE
## 7                FALSE              FALSE               FALSE
## 8                FALSE              FALSE               FALSE
## 9                FALSE              FALSE               FALSE
## 10               FALSE              FALSE               FALSE
## 11               FALSE              FALSE               FALSE
## 12               FALSE              FALSE               FALSE
##    NeighborhoodClearCr NeighborhoodCollgCr NeighborhoodCrawfor
## 1                FALSE               FALSE               FALSE
## 2                FALSE               FALSE               FALSE
## 3                FALSE               FALSE               FALSE
## 4                FALSE               FALSE               FALSE
## 5                FALSE               FALSE               FALSE
## 6                FALSE               FALSE               FALSE
## 7                FALSE               FALSE               FALSE
## 8                FALSE               FALSE                TRUE
## 9                FALSE               FALSE                TRUE
## 10               FALSE               FALSE                TRUE
## 11               FALSE               FALSE                TRUE
## 12               FALSE               FALSE                TRUE
##    NeighborhoodEdwards NeighborhoodGilbert NeighborhoodGreens
## 1                FALSE               FALSE              FALSE
## 2                FALSE               FALSE              FALSE
## 3                FALSE               FALSE              FALSE
## 4                FALSE               FALSE              FALSE
## 5                FALSE               FALSE              FALSE
## 6                FALSE               FALSE              FALSE
## 7                FALSE               FALSE              FALSE
## 8                FALSE               FALSE              FALSE
## 9                FALSE               FALSE              FALSE
## 10               FALSE               FALSE              FALSE
## 11               FALSE               FALSE              FALSE
## 12               FALSE               FALSE              FALSE
##    NeighborhoodGrnHill NeighborhoodIDOTRR NeighborhoodLandmrk
## 1                FALSE              FALSE               FALSE
## 2                FALSE              FALSE               FALSE
## 3                FALSE              FALSE               FALSE
## 4                FALSE              FALSE               FALSE
## 5                FALSE              FALSE               FALSE
## 6                FALSE              FALSE               FALSE
## 7                FALSE              FALSE               FALSE
## 8                FALSE              FALSE               FALSE
## 9                FALSE              FALSE               FALSE
## 10               FALSE              FALSE               FALSE
## 11               FALSE              FALSE               FALSE
## 12               FALSE              FALSE               FALSE
##    NeighborhoodMeadowV NeighborhoodMitchel NeighborhoodNAmes
## 1                FALSE               FALSE             FALSE
## 2                FALSE               FALSE             FALSE
## 3                FALSE               FALSE             FALSE
## 4                FALSE               FALSE             FALSE
## 5                FALSE               FALSE             FALSE
## 6                FALSE               FALSE             FALSE
## 7                FALSE               FALSE             FALSE
## 8                FALSE               FALSE             FALSE
## 9                FALSE               FALSE             FALSE
## 10               FALSE               FALSE             FALSE
## 11               FALSE               FALSE             FALSE
## 12               FALSE               FALSE             FALSE
##    NeighborhoodNoRidge NeighborhoodNPkVill NeighborhoodNridgHt
## 1                FALSE               FALSE               FALSE
## 2                FALSE               FALSE               FALSE
## 3                FALSE               FALSE               FALSE
## 4                FALSE               FALSE               FALSE
## 5                FALSE               FALSE               FALSE
## 6                FALSE               FALSE               FALSE
## 7                FALSE               FALSE               FALSE
## 8                FALSE               FALSE               FALSE
## 9                FALSE               FALSE               FALSE
## 10               FALSE               FALSE               FALSE
## 11               FALSE               FALSE               FALSE
## 12               FALSE               FALSE               FALSE
##    NeighborhoodNWAmes NeighborhoodOldTown NeighborhoodSawyer
## 1               FALSE               FALSE              FALSE
## 2               FALSE               FALSE              FALSE
## 3               FALSE               FALSE              FALSE
## 4               FALSE               FALSE              FALSE
## 5               FALSE               FALSE              FALSE
## 6               FALSE               FALSE              FALSE
## 7               FALSE               FALSE              FALSE
## 8               FALSE               FALSE              FALSE
## 9               FALSE               FALSE              FALSE
## 10              FALSE               FALSE              FALSE
## 11              FALSE               FALSE              FALSE
## 12              FALSE               FALSE              FALSE
##    NeighborhoodSawyerW NeighborhoodSomerst NeighborhoodStoneBr
## 1                FALSE               FALSE               FALSE
## 2                FALSE               FALSE               FALSE
## 3                FALSE               FALSE               FALSE
## 4                FALSE               FALSE               FALSE
## 5                FALSE               FALSE               FALSE
## 6                FALSE               FALSE               FALSE
## 7                FALSE               FALSE               FALSE
## 8                FALSE               FALSE               FALSE
## 9                FALSE               FALSE               FALSE
## 10               FALSE               FALSE               FALSE
## 11               FALSE               FALSE               FALSE
## 12               FALSE               FALSE               FALSE
##    NeighborhoodSWISU NeighborhoodTimber NeighborhoodVeenker Condition.1Feedr
## 1              FALSE              FALSE               FALSE            FALSE
## 2              FALSE              FALSE               FALSE            FALSE
## 3              FALSE              FALSE               FALSE            FALSE
## 4              FALSE              FALSE               FALSE            FALSE
## 5              FALSE              FALSE               FALSE            FALSE
## 6              FALSE              FALSE               FALSE            FALSE
## 7              FALSE              FALSE               FALSE            FALSE
## 8              FALSE              FALSE               FALSE            FALSE
## 9              FALSE              FALSE               FALSE            FALSE
## 10             FALSE              FALSE               FALSE            FALSE
## 11             FALSE              FALSE               FALSE            FALSE
## 12             FALSE              FALSE               FALSE            FALSE
##    Condition.1Norm Condition.1PosA Condition.1PosN Condition.1RRAe
## 1            FALSE           FALSE           FALSE           FALSE
## 2            FALSE           FALSE           FALSE           FALSE
## 3            FALSE           FALSE           FALSE           FALSE
## 4            FALSE           FALSE           FALSE           FALSE
## 5            FALSE           FALSE           FALSE           FALSE
## 6            FALSE           FALSE           FALSE           FALSE
## 7            FALSE           FALSE           FALSE           FALSE
## 8            FALSE           FALSE           FALSE           FALSE
## 9            FALSE           FALSE           FALSE           FALSE
## 10           FALSE           FALSE           FALSE           FALSE
## 11           FALSE           FALSE           FALSE           FALSE
## 12           FALSE           FALSE           FALSE           FALSE
##    Condition.1RRAn Condition.1RRNe Condition.1RRNn Condition.2Feedr
## 1            FALSE           FALSE           FALSE            FALSE
## 2            FALSE           FALSE           FALSE            FALSE
## 3            FALSE           FALSE           FALSE            FALSE
## 4            FALSE           FALSE           FALSE            FALSE
## 5            FALSE           FALSE           FALSE            FALSE
## 6            FALSE           FALSE           FALSE            FALSE
## 7            FALSE           FALSE           FALSE            FALSE
## 8            FALSE           FALSE           FALSE            FALSE
## 9            FALSE           FALSE           FALSE            FALSE
## 10           FALSE           FALSE           FALSE            FALSE
## 11           FALSE           FALSE           FALSE            FALSE
## 12           FALSE           FALSE           FALSE            FALSE
##    Condition.2Norm Condition.2PosA Condition.2PosN Condition.2RRAe
## 1            FALSE           FALSE           FALSE           FALSE
## 2            FALSE           FALSE           FALSE           FALSE
## 3            FALSE           FALSE           FALSE           FALSE
## 4            FALSE           FALSE           FALSE           FALSE
## 5            FALSE           FALSE           FALSE           FALSE
## 6            FALSE           FALSE           FALSE           FALSE
## 7            FALSE           FALSE           FALSE           FALSE
## 8            FALSE           FALSE           FALSE           FALSE
## 9            FALSE           FALSE           FALSE           FALSE
## 10           FALSE           FALSE           FALSE           FALSE
## 11           FALSE           FALSE           FALSE           FALSE
## 12           FALSE           FALSE           FALSE           FALSE
##    Condition.2RRAn Condition.2RRNn Bldg.Type2fmCon Bldg.TypeDuplex
## 1            FALSE           FALSE           FALSE           FALSE
## 2            FALSE           FALSE           FALSE           FALSE
## 3            FALSE           FALSE           FALSE           FALSE
## 4            FALSE           FALSE           FALSE           FALSE
## 5            FALSE           FALSE           FALSE           FALSE
## 6            FALSE           FALSE           FALSE           FALSE
## 7            FALSE           FALSE           FALSE           FALSE
## 8            FALSE           FALSE           FALSE           FALSE
## 9            FALSE           FALSE           FALSE           FALSE
## 10           FALSE           FALSE           FALSE           FALSE
## 11           FALSE           FALSE           FALSE           FALSE
## 12           FALSE           FALSE           FALSE           FALSE
##    Bldg.TypeTwnhs Bldg.TypeTwnhsE House.Style1.5Unf House.Style1Story
## 1           FALSE           FALSE             FALSE             FALSE
## 2           FALSE           FALSE             FALSE             FALSE
## 3           FALSE           FALSE             FALSE             FALSE
## 4           FALSE           FALSE             FALSE             FALSE
## 5           FALSE           FALSE             FALSE             FALSE
## 6           FALSE           FALSE             FALSE             FALSE
## 7           FALSE           FALSE             FALSE             FALSE
## 8           FALSE           FALSE             FALSE             FALSE
## 9           FALSE           FALSE             FALSE             FALSE
## 10          FALSE           FALSE             FALSE             FALSE
## 11           TRUE           FALSE             FALSE             FALSE
## 12           TRUE           FALSE             FALSE             FALSE
##    House.Style2.5Fin House.Style2.5Unf House.Style2Story House.StyleSFoyer
## 1              FALSE             FALSE             FALSE             FALSE
## 2              FALSE             FALSE             FALSE             FALSE
## 3              FALSE             FALSE             FALSE             FALSE
## 4              FALSE             FALSE             FALSE             FALSE
## 5              FALSE             FALSE             FALSE             FALSE
## 6              FALSE             FALSE             FALSE             FALSE
## 7              FALSE             FALSE             FALSE             FALSE
## 8              FALSE             FALSE             FALSE             FALSE
## 9              FALSE             FALSE             FALSE             FALSE
## 10             FALSE             FALSE             FALSE             FALSE
## 11             FALSE             FALSE             FALSE             FALSE
## 12             FALSE             FALSE             FALSE             FALSE
##    House.StyleSLvl Overall.Qual Overall.Cond Year.Built Year.Remod.Add
## 1            FALSE         TRUE        FALSE      FALSE          FALSE
## 2            FALSE         TRUE        FALSE      FALSE          FALSE
## 3            FALSE         TRUE        FALSE       TRUE          FALSE
## 4            FALSE         TRUE         TRUE       TRUE          FALSE
## 5            FALSE         TRUE         TRUE       TRUE          FALSE
## 6            FALSE         TRUE         TRUE       TRUE          FALSE
## 7            FALSE         TRUE         TRUE       TRUE          FALSE
## 8            FALSE         TRUE         TRUE       TRUE          FALSE
## 9            FALSE         TRUE         TRUE       TRUE          FALSE
## 10           FALSE         TRUE         TRUE       TRUE          FALSE
## 11           FALSE         TRUE         TRUE       TRUE          FALSE
## 12           FALSE         TRUE         TRUE       TRUE          FALSE
##    Roof.StyleGable Roof.StyleGambrel Roof.StyleHip Roof.StyleMansard
## 1            FALSE             FALSE         FALSE             FALSE
## 2            FALSE             FALSE         FALSE             FALSE
## 3            FALSE             FALSE         FALSE             FALSE
## 4            FALSE             FALSE         FALSE             FALSE
## 5            FALSE             FALSE         FALSE             FALSE
## 6            FALSE             FALSE         FALSE             FALSE
## 7            FALSE             FALSE         FALSE             FALSE
## 8            FALSE             FALSE         FALSE             FALSE
## 9            FALSE             FALSE         FALSE             FALSE
## 10           FALSE             FALSE         FALSE             FALSE
## 11           FALSE             FALSE         FALSE             FALSE
## 12           FALSE             FALSE         FALSE             FALSE
##    Roof.StyleShed Roof.MatlCompShg Roof.MatlMembran Roof.MatlMetal
## 1           FALSE            FALSE            FALSE          FALSE
## 2           FALSE            FALSE            FALSE          FALSE
## 3           FALSE            FALSE            FALSE          FALSE
## 4           FALSE            FALSE            FALSE          FALSE
## 5           FALSE            FALSE            FALSE          FALSE
## 6           FALSE            FALSE            FALSE          FALSE
## 7           FALSE            FALSE            FALSE          FALSE
## 8           FALSE            FALSE            FALSE          FALSE
## 9           FALSE            FALSE            FALSE          FALSE
## 10          FALSE            FALSE            FALSE          FALSE
## 11          FALSE            FALSE            FALSE          FALSE
## 12          FALSE            FALSE            FALSE          FALSE
##    Roof.MatlRoll Roof.MatlTar&Grv Roof.MatlWdShake Roof.MatlWdShngl
## 1          FALSE            FALSE            FALSE            FALSE
## 2          FALSE            FALSE            FALSE            FALSE
## 3          FALSE            FALSE            FALSE            FALSE
## 4          FALSE            FALSE            FALSE            FALSE
## 5          FALSE            FALSE            FALSE            FALSE
## 6          FALSE            FALSE            FALSE            FALSE
## 7          FALSE            FALSE            FALSE            FALSE
## 8          FALSE            FALSE            FALSE            FALSE
## 9          FALSE            FALSE            FALSE            FALSE
## 10         FALSE            FALSE            FALSE            FALSE
## 11         FALSE            FALSE            FALSE            FALSE
## 12         FALSE            FALSE            FALSE            FALSE
##    Exterior.1stAsphShn Exterior.1stBrkComm Exterior.1stBrkFace
## 1                FALSE               FALSE               FALSE
## 2                FALSE               FALSE               FALSE
## 3                FALSE               FALSE               FALSE
## 4                FALSE               FALSE               FALSE
## 5                FALSE               FALSE               FALSE
## 6                FALSE               FALSE               FALSE
## 7                FALSE               FALSE               FALSE
## 8                FALSE               FALSE               FALSE
## 9                FALSE               FALSE               FALSE
## 10               FALSE               FALSE               FALSE
## 11               FALSE               FALSE               FALSE
## 12               FALSE               FALSE               FALSE
##    Exterior.1stCBlock Exterior.1stCemntBd Exterior.1stHdBoard
## 1               FALSE               FALSE               FALSE
## 2               FALSE               FALSE               FALSE
## 3               FALSE               FALSE               FALSE
## 4               FALSE               FALSE               FALSE
## 5               FALSE               FALSE               FALSE
## 6               FALSE               FALSE               FALSE
## 7               FALSE               FALSE               FALSE
## 8               FALSE               FALSE               FALSE
## 9               FALSE               FALSE               FALSE
## 10              FALSE               FALSE               FALSE
## 11              FALSE               FALSE               FALSE
## 12              FALSE               FALSE               FALSE
##    Exterior.1stImStucc Exterior.1stMetalSd Exterior.1stPlywood
## 1                FALSE               FALSE               FALSE
## 2                FALSE               FALSE               FALSE
## 3                FALSE               FALSE               FALSE
## 4                FALSE               FALSE               FALSE
## 5                FALSE               FALSE               FALSE
## 6                FALSE               FALSE               FALSE
## 7                FALSE               FALSE               FALSE
## 8                FALSE               FALSE               FALSE
## 9                FALSE               FALSE               FALSE
## 10               FALSE               FALSE               FALSE
## 11               FALSE               FALSE               FALSE
## 12               FALSE               FALSE               FALSE
##    Exterior.1stPreCast Exterior.1stStone Exterior.1stStucco Exterior.1stVinylSd
## 1                FALSE             FALSE              FALSE               FALSE
## 2                FALSE             FALSE              FALSE               FALSE
## 3                FALSE             FALSE              FALSE               FALSE
## 4                FALSE             FALSE              FALSE               FALSE
## 5                FALSE             FALSE              FALSE               FALSE
## 6                FALSE             FALSE              FALSE               FALSE
## 7                FALSE             FALSE              FALSE               FALSE
## 8                FALSE             FALSE              FALSE               FALSE
## 9                FALSE             FALSE              FALSE               FALSE
## 10               FALSE             FALSE              FALSE               FALSE
## 11               FALSE             FALSE              FALSE               FALSE
## 12               FALSE             FALSE              FALSE               FALSE
##    Exterior.1stWd Sdng Exterior.1stWdShing Exterior.2ndAsphShn
## 1                FALSE               FALSE               FALSE
## 2                FALSE               FALSE               FALSE
## 3                FALSE               FALSE               FALSE
## 4                FALSE               FALSE               FALSE
## 5                FALSE               FALSE               FALSE
## 6                FALSE               FALSE               FALSE
## 7                FALSE               FALSE               FALSE
## 8                FALSE               FALSE               FALSE
## 9                FALSE               FALSE               FALSE
## 10               FALSE               FALSE               FALSE
## 11               FALSE               FALSE               FALSE
## 12               FALSE               FALSE               FALSE
##    Exterior.2ndBrk Cmn Exterior.2ndBrkFace Exterior.2ndCBlock
## 1                FALSE               FALSE              FALSE
## 2                FALSE               FALSE              FALSE
## 3                FALSE               FALSE              FALSE
## 4                FALSE               FALSE              FALSE
## 5                FALSE               FALSE              FALSE
## 6                FALSE               FALSE              FALSE
## 7                FALSE               FALSE              FALSE
## 8                FALSE               FALSE              FALSE
## 9                FALSE               FALSE              FALSE
## 10               FALSE               FALSE              FALSE
## 11               FALSE               FALSE              FALSE
## 12               FALSE               FALSE              FALSE
##    Exterior.2ndCmentBd Exterior.2ndHdBoard Exterior.2ndImStucc
## 1                FALSE               FALSE               FALSE
## 2                FALSE               FALSE               FALSE
## 3                FALSE               FALSE               FALSE
## 4                FALSE               FALSE               FALSE
## 5                FALSE               FALSE               FALSE
## 6                FALSE               FALSE               FALSE
## 7                FALSE               FALSE               FALSE
## 8                FALSE               FALSE               FALSE
## 9                FALSE               FALSE               FALSE
## 10               FALSE               FALSE               FALSE
## 11               FALSE               FALSE               FALSE
## 12               FALSE               FALSE               FALSE
##    Exterior.2ndMetalSd Exterior.2ndOther Exterior.2ndPlywood
## 1                FALSE             FALSE               FALSE
## 2                FALSE             FALSE               FALSE
## 3                FALSE             FALSE               FALSE
## 4                FALSE             FALSE               FALSE
## 5                FALSE             FALSE               FALSE
## 6                FALSE             FALSE               FALSE
## 7                FALSE             FALSE               FALSE
## 8                FALSE             FALSE               FALSE
## 9                FALSE             FALSE               FALSE
## 10               FALSE             FALSE               FALSE
## 11               FALSE             FALSE               FALSE
## 12               FALSE             FALSE               FALSE
##    Exterior.2ndPreCast Exterior.2ndStone Exterior.2ndStucco Exterior.2ndVinylSd
## 1                FALSE             FALSE              FALSE               FALSE
## 2                FALSE             FALSE              FALSE               FALSE
## 3                FALSE             FALSE              FALSE               FALSE
## 4                FALSE             FALSE              FALSE               FALSE
## 5                FALSE             FALSE              FALSE               FALSE
## 6                FALSE             FALSE              FALSE               FALSE
## 7                FALSE             FALSE              FALSE               FALSE
## 8                FALSE             FALSE              FALSE               FALSE
## 9                FALSE             FALSE              FALSE               FALSE
## 10               FALSE             FALSE              FALSE               FALSE
## 11               FALSE             FALSE              FALSE               FALSE
## 12               FALSE             FALSE              FALSE               FALSE
##    Exterior.2ndWd Sdng Exterior.2ndWd Shng Mas.Vnr.TypeBrkCmn
## 1                FALSE               FALSE              FALSE
## 2                FALSE               FALSE              FALSE
## 3                FALSE               FALSE              FALSE
## 4                FALSE               FALSE              FALSE
## 5                FALSE               FALSE              FALSE
## 6                FALSE               FALSE              FALSE
## 7                FALSE               FALSE              FALSE
## 8                FALSE               FALSE              FALSE
## 9                FALSE               FALSE              FALSE
## 10               FALSE               FALSE              FALSE
## 11               FALSE               FALSE              FALSE
## 12               FALSE               FALSE              FALSE
##    Mas.Vnr.TypeBrkFace Mas.Vnr.TypeCBlock Mas.Vnr.TypeNone Mas.Vnr.TypeStone
## 1                FALSE              FALSE            FALSE             FALSE
## 2                FALSE              FALSE            FALSE             FALSE
## 3                FALSE              FALSE            FALSE             FALSE
## 4                FALSE              FALSE            FALSE             FALSE
## 5                FALSE              FALSE            FALSE             FALSE
## 6                FALSE              FALSE            FALSE             FALSE
## 7                FALSE              FALSE            FALSE             FALSE
## 8                FALSE              FALSE            FALSE             FALSE
## 9                FALSE              FALSE            FALSE             FALSE
## 10               FALSE              FALSE            FALSE             FALSE
## 11               FALSE              FALSE            FALSE             FALSE
## 12               FALSE              FALSE            FALSE             FALSE
##    Mas.Vnr.Area Exter.QualFa Exter.QualGd Exter.QualTA Exter.CondFa
## 1         FALSE        FALSE        FALSE        FALSE        FALSE
## 2         FALSE        FALSE        FALSE        FALSE        FALSE
## 3         FALSE        FALSE        FALSE        FALSE        FALSE
## 4         FALSE        FALSE        FALSE        FALSE        FALSE
## 5         FALSE        FALSE        FALSE        FALSE        FALSE
## 6         FALSE        FALSE        FALSE        FALSE        FALSE
## 7         FALSE        FALSE        FALSE        FALSE        FALSE
## 8         FALSE        FALSE        FALSE        FALSE        FALSE
## 9         FALSE        FALSE        FALSE        FALSE        FALSE
## 10        FALSE        FALSE        FALSE        FALSE        FALSE
## 11        FALSE        FALSE        FALSE        FALSE        FALSE
## 12        FALSE        FALSE        FALSE        FALSE        FALSE
##    Exter.CondGd Exter.CondPo Exter.CondTA FoundationCBlock FoundationPConc
## 1         FALSE        FALSE        FALSE            FALSE           FALSE
## 2         FALSE        FALSE        FALSE            FALSE           FALSE
## 3         FALSE        FALSE        FALSE            FALSE           FALSE
## 4         FALSE        FALSE        FALSE            FALSE           FALSE
## 5         FALSE        FALSE        FALSE            FALSE           FALSE
## 6         FALSE        FALSE        FALSE            FALSE           FALSE
## 7         FALSE        FALSE        FALSE            FALSE           FALSE
## 8         FALSE        FALSE        FALSE            FALSE           FALSE
## 9         FALSE        FALSE        FALSE            FALSE           FALSE
## 10        FALSE        FALSE        FALSE            FALSE           FALSE
## 11        FALSE        FALSE        FALSE            FALSE           FALSE
## 12        FALSE        FALSE        FALSE            FALSE           FALSE
##    FoundationSlab FoundationStone FoundationWood Total.Bsmt.SF HeatingGasA
## 1           FALSE           FALSE          FALSE         FALSE       FALSE
## 2           FALSE           FALSE          FALSE         FALSE       FALSE
## 3           FALSE           FALSE          FALSE         FALSE       FALSE
## 4           FALSE           FALSE          FALSE         FALSE       FALSE
## 5           FALSE           FALSE          FALSE          TRUE       FALSE
## 6           FALSE           FALSE          FALSE          TRUE       FALSE
## 7           FALSE           FALSE          FALSE          TRUE       FALSE
## 8           FALSE           FALSE          FALSE          TRUE       FALSE
## 9           FALSE           FALSE          FALSE          TRUE       FALSE
## 10          FALSE           FALSE          FALSE          TRUE       FALSE
## 11          FALSE           FALSE          FALSE          TRUE       FALSE
## 12          FALSE           FALSE          FALSE          TRUE       FALSE
##    HeatingGasW HeatingGrav HeatingOthW HeatingWall Heating.QCFa Heating.QCGd
## 1        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 2        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 3        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 4        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 5        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 6        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 7        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 8        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 9        FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 10       FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 11       FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
## 12       FALSE       FALSE       FALSE       FALSE        FALSE        FALSE
##    Heating.QCPo Heating.QCTA Central.AirY ElectricalFuseA ElectricalFuseF
## 1         FALSE        FALSE        FALSE           FALSE           FALSE
## 2         FALSE        FALSE        FALSE           FALSE           FALSE
## 3         FALSE        FALSE        FALSE           FALSE           FALSE
## 4         FALSE        FALSE        FALSE           FALSE           FALSE
## 5         FALSE        FALSE        FALSE           FALSE           FALSE
## 6         FALSE        FALSE        FALSE           FALSE           FALSE
## 7         FALSE        FALSE        FALSE           FALSE           FALSE
## 8         FALSE        FALSE        FALSE           FALSE           FALSE
## 9         FALSE        FALSE        FALSE           FALSE           FALSE
## 10        FALSE        FALSE        FALSE           FALSE           FALSE
## 11        FALSE        FALSE        FALSE           FALSE           FALSE
## 12        FALSE        FALSE        FALSE           FALSE           FALSE
##    ElectricalFuseP ElectricalMix ElectricalSBrkr X1st.Flr.SF X2nd.Flr.SF
## 1            FALSE         FALSE           FALSE       FALSE       FALSE
## 2            FALSE         FALSE           FALSE       FALSE       FALSE
## 3            FALSE         FALSE           FALSE       FALSE       FALSE
## 4            FALSE         FALSE           FALSE       FALSE       FALSE
## 5            FALSE         FALSE           FALSE       FALSE       FALSE
## 6            FALSE         FALSE           FALSE       FALSE       FALSE
## 7            FALSE         FALSE           FALSE       FALSE       FALSE
## 8            FALSE         FALSE           FALSE       FALSE       FALSE
## 9            FALSE         FALSE           FALSE       FALSE       FALSE
## 10           FALSE         FALSE           FALSE       FALSE       FALSE
## 11           FALSE         FALSE           FALSE       FALSE       FALSE
## 12           FALSE         FALSE           FALSE       FALSE       FALSE
##    Low.Qual.Fin.SF Gr.Liv.Area Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr
## 1            FALSE       FALSE     FALSE     FALSE         FALSE         FALSE
## 2            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 3            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 4            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 5            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 6            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 7            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 8            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 9            FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 10           FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 11           FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
## 12           FALSE        TRUE     FALSE     FALSE         FALSE         FALSE
##    Kitchen.QualFa Kitchen.QualGd Kitchen.QualPo Kitchen.QualTA TotRms.AbvGrd
## 1           FALSE          FALSE          FALSE          FALSE         FALSE
## 2           FALSE          FALSE          FALSE          FALSE         FALSE
## 3           FALSE          FALSE          FALSE          FALSE         FALSE
## 4           FALSE          FALSE          FALSE          FALSE         FALSE
## 5           FALSE          FALSE          FALSE          FALSE         FALSE
## 6           FALSE          FALSE          FALSE          FALSE         FALSE
## 7           FALSE          FALSE          FALSE          FALSE         FALSE
## 8           FALSE          FALSE          FALSE          FALSE         FALSE
## 9           FALSE          FALSE          FALSE          FALSE         FALSE
## 10          FALSE          FALSE          FALSE          FALSE         FALSE
## 11          FALSE          FALSE          FALSE          FALSE         FALSE
## 12          FALSE          FALSE          FALSE          FALSE         FALSE
##    FunctionalMaj2 FunctionalMin1 FunctionalMin2 FunctionalMod FunctionalSal
## 1           FALSE          FALSE          FALSE         FALSE         FALSE
## 2           FALSE          FALSE          FALSE         FALSE         FALSE
## 3           FALSE          FALSE          FALSE         FALSE         FALSE
## 4           FALSE          FALSE          FALSE         FALSE         FALSE
## 5           FALSE          FALSE          FALSE         FALSE         FALSE
## 6           FALSE          FALSE          FALSE         FALSE         FALSE
## 7           FALSE          FALSE          FALSE         FALSE         FALSE
## 8           FALSE          FALSE          FALSE         FALSE         FALSE
## 9           FALSE          FALSE          FALSE         FALSE          TRUE
## 10          FALSE          FALSE          FALSE         FALSE          TRUE
## 11          FALSE          FALSE          FALSE         FALSE          TRUE
## 12          FALSE          FALSE          FALSE         FALSE          TRUE
##    FunctionalSev FunctionalTyp Fireplaces Garage.TypeAttchd Garage.TypeBasment
## 1          FALSE         FALSE      FALSE             FALSE              FALSE
## 2          FALSE         FALSE      FALSE             FALSE              FALSE
## 3          FALSE         FALSE      FALSE             FALSE              FALSE
## 4          FALSE         FALSE      FALSE             FALSE              FALSE
## 5          FALSE         FALSE      FALSE             FALSE              FALSE
## 6          FALSE         FALSE      FALSE             FALSE              FALSE
## 7          FALSE         FALSE      FALSE             FALSE              FALSE
## 8          FALSE         FALSE      FALSE             FALSE              FALSE
## 9          FALSE         FALSE      FALSE             FALSE              FALSE
## 10         FALSE         FALSE       TRUE             FALSE              FALSE
## 11         FALSE         FALSE       TRUE             FALSE              FALSE
## 12         FALSE         FALSE       TRUE             FALSE              FALSE
##    Garage.TypeBuiltIn Garage.TypeCarPort Garage.TypeDetchd Garage.Yr.Blt
## 1               FALSE              FALSE             FALSE         FALSE
## 2               FALSE              FALSE             FALSE         FALSE
## 3               FALSE              FALSE             FALSE         FALSE
## 4               FALSE              FALSE             FALSE         FALSE
## 5               FALSE              FALSE             FALSE         FALSE
## 6               FALSE              FALSE             FALSE         FALSE
## 7               FALSE              FALSE             FALSE         FALSE
## 8               FALSE              FALSE             FALSE         FALSE
## 9               FALSE              FALSE             FALSE         FALSE
## 10              FALSE              FALSE             FALSE         FALSE
## 11              FALSE              FALSE             FALSE         FALSE
## 12              FALSE              FALSE             FALSE         FALSE
##    Garage.FinishFin Garage.FinishRFn Garage.FinishUnf Garage.Cars Garage.Area
## 1             FALSE            FALSE            FALSE       FALSE       FALSE
## 2             FALSE            FALSE            FALSE       FALSE       FALSE
## 3             FALSE            FALSE            FALSE       FALSE       FALSE
## 4             FALSE            FALSE            FALSE       FALSE       FALSE
## 5             FALSE            FALSE            FALSE       FALSE       FALSE
## 6             FALSE            FALSE            FALSE        TRUE       FALSE
## 7             FALSE            FALSE            FALSE        TRUE       FALSE
## 8             FALSE            FALSE            FALSE        TRUE       FALSE
## 9             FALSE            FALSE            FALSE        TRUE       FALSE
## 10            FALSE            FALSE            FALSE        TRUE       FALSE
## 11            FALSE            FALSE            FALSE        TRUE       FALSE
## 12            FALSE            FALSE            FALSE        TRUE       FALSE
##    Garage.QualEx Garage.QualFa Garage.QualGd Garage.QualPo Garage.QualTA
## 1          FALSE         FALSE         FALSE         FALSE         FALSE
## 2          FALSE         FALSE         FALSE         FALSE         FALSE
## 3          FALSE         FALSE         FALSE         FALSE         FALSE
## 4          FALSE         FALSE         FALSE         FALSE         FALSE
## 5          FALSE         FALSE         FALSE         FALSE         FALSE
## 6          FALSE         FALSE         FALSE         FALSE         FALSE
## 7          FALSE         FALSE         FALSE         FALSE         FALSE
## 8          FALSE         FALSE         FALSE         FALSE         FALSE
## 9          FALSE         FALSE         FALSE         FALSE         FALSE
## 10         FALSE         FALSE         FALSE         FALSE         FALSE
## 11         FALSE         FALSE         FALSE         FALSE         FALSE
## 12         FALSE         FALSE         FALSE         FALSE         FALSE
##    Garage.CondEx Garage.CondFa Garage.CondGd Garage.CondPo Garage.CondTA
## 1          FALSE         FALSE         FALSE         FALSE         FALSE
## 2          FALSE         FALSE         FALSE         FALSE         FALSE
## 3          FALSE         FALSE         FALSE         FALSE         FALSE
## 4          FALSE         FALSE         FALSE         FALSE         FALSE
## 5          FALSE         FALSE         FALSE         FALSE         FALSE
## 6          FALSE         FALSE         FALSE         FALSE         FALSE
## 7          FALSE         FALSE         FALSE         FALSE         FALSE
## 8          FALSE         FALSE         FALSE         FALSE         FALSE
## 9          FALSE         FALSE         FALSE         FALSE         FALSE
## 10         FALSE         FALSE         FALSE         FALSE         FALSE
## 11         FALSE         FALSE         FALSE         FALSE         FALSE
## 12         FALSE         FALSE         FALSE         FALSE         FALSE
##    Paved.DriveP Paved.DriveY Wood.Deck.SF Open.Porch.SF Enclosed.Porch
## 1         FALSE        FALSE        FALSE         FALSE          FALSE
## 2         FALSE        FALSE        FALSE         FALSE          FALSE
## 3         FALSE        FALSE        FALSE         FALSE          FALSE
## 4         FALSE        FALSE        FALSE         FALSE          FALSE
## 5         FALSE        FALSE        FALSE         FALSE          FALSE
## 6         FALSE        FALSE        FALSE         FALSE          FALSE
## 7         FALSE        FALSE        FALSE         FALSE          FALSE
## 8         FALSE        FALSE        FALSE         FALSE          FALSE
## 9         FALSE        FALSE        FALSE         FALSE          FALSE
## 10        FALSE        FALSE        FALSE         FALSE          FALSE
## 11        FALSE        FALSE        FALSE         FALSE          FALSE
## 12        FALSE        FALSE        FALSE         FALSE          FALSE
##    X3Ssn.Porch Screen.Porch Pool.Area Misc.Val Mo.Sold Yr.Sold Sale.TypeCon
## 1        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 2        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 3        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 4        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 5        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 6        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 7        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 8        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 9        FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 10       FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 11       FALSE        FALSE     FALSE    FALSE   FALSE   FALSE        FALSE
## 12       FALSE        FALSE     FALSE     TRUE   FALSE   FALSE        FALSE
##    Sale.TypeConLD Sale.TypeConLI Sale.TypeConLw Sale.TypeCWD Sale.TypeNew
## 1           FALSE          FALSE          FALSE        FALSE        FALSE
## 2           FALSE          FALSE          FALSE        FALSE        FALSE
## 3           FALSE          FALSE          FALSE        FALSE        FALSE
## 4           FALSE          FALSE          FALSE        FALSE        FALSE
## 5           FALSE          FALSE          FALSE        FALSE        FALSE
## 6           FALSE          FALSE          FALSE        FALSE        FALSE
## 7           FALSE          FALSE          FALSE        FALSE        FALSE
## 8           FALSE          FALSE          FALSE        FALSE        FALSE
## 9           FALSE          FALSE          FALSE        FALSE        FALSE
## 10          FALSE          FALSE          FALSE        FALSE        FALSE
## 11          FALSE          FALSE          FALSE        FALSE        FALSE
## 12          FALSE          FALSE          FALSE        FALSE        FALSE
##    Sale.TypeOth Sale.TypeVWD Sale.TypeWD  Sale.ConditionAdjLand
## 1         FALSE        FALSE        FALSE                 FALSE
## 2         FALSE        FALSE        FALSE                 FALSE
## 3         FALSE        FALSE        FALSE                 FALSE
## 4         FALSE        FALSE        FALSE                 FALSE
## 5         FALSE        FALSE        FALSE                 FALSE
## 6         FALSE        FALSE        FALSE                 FALSE
## 7         FALSE        FALSE        FALSE                 FALSE
## 8         FALSE        FALSE        FALSE                 FALSE
## 9         FALSE        FALSE        FALSE                 FALSE
## 10        FALSE        FALSE        FALSE                 FALSE
## 11        FALSE        FALSE        FALSE                 FALSE
## 12        FALSE        FALSE        FALSE                 FALSE
##    Sale.ConditionAlloca Sale.ConditionFamily Sale.ConditionNormal
## 1                 FALSE                FALSE                FALSE
## 2                 FALSE                FALSE                FALSE
## 3                 FALSE                FALSE                FALSE
## 4                 FALSE                FALSE                FALSE
## 5                 FALSE                FALSE                FALSE
## 6                 FALSE                FALSE                FALSE
## 7                 FALSE                FALSE                FALSE
## 8                 FALSE                FALSE                FALSE
## 9                 FALSE                FALSE                FALSE
## 10                FALSE                FALSE                FALSE
## 11                FALSE                FALSE                FALSE
## 12                FALSE                FALSE                FALSE
##    Sale.ConditionPartial
## 1                  FALSE
## 2                  FALSE
## 3                  FALSE
## 4                  FALSE
## 5                  FALSE
## 6                  FALSE
## 7                  FALSE
## 8                  FALSE
## 9                  FALSE
## 10                 FALSE
## 11                 FALSE
## 12                 FALSE

Based on step wise selection, these are the features containing in the model: Gr.Liv.Area, Overall.Qual, Overall.Cond, Year.Built, Total.Bsmt.SF, Fireplaces, Garage.Cars. Some categorical variables with TRUE in one of the sub-categories are MS.Zoning, Neighborhood, Bldg.Type, Functional

5.3.2 Testing the effects all potential features together

The function drop1() gives an overview about the comparison of original model to the model where one explanatory variable was omitted.

# Fitting all potential features together
lm_fit_focus <- lm(log(SalePrice) ~ Gr.Liv.Area + MS.Zoning + Neighborhood + 
                      Overall.Qual + Overall.Cond + Bldg.Type + Year.Built + 
                      Foundation + Total.Bsmt.SF + Fireplaces + Functional + 
                      House.Style + Garage.Cars + Sale.Condition, 
                    data = d_house)

# Testing all potential features together
drop1(lm_fit_focus, test = "F")
## Single term deletions
## 
## Model:
## log(SalePrice) ~ Gr.Liv.Area + MS.Zoning + Neighborhood + Overall.Qual + 
##     Overall.Cond + Bldg.Type + Year.Built + Foundation + Total.Bsmt.SF + 
##     Fireplaces + Functional + House.Style + Garage.Cars + Sale.Condition
##                Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>                      52.404 -11642                       
## Gr.Liv.Area     1   11.8048 64.209 -11049 644.0319 < 2.2e-16 ***
## MS.Zoning       6    0.9434 53.348 -11601   8.5785 2.896e-09 ***
## Neighborhood   27    7.0510 59.455 -11326  14.2475 < 2.2e-16 ***
## Overall.Qual    1    6.9682 59.372 -11278 380.1607 < 2.2e-16 ***
## Overall.Cond    1    7.2569 59.661 -11264 395.9136 < 2.2e-16 ***
## Bldg.Type       4    1.7961 54.200 -11551  24.4967 < 2.2e-16 ***
## Year.Built      1    3.0252 55.429 -11479 165.0451 < 2.2e-16 ***
## Foundation      5    0.5758 52.980 -11620   6.2822 8.337e-06 ***
## Total.Bsmt.SF   1    0.9273 53.331 -11592  50.5908 1.432e-12 ***
## Fireplaces      1    1.5554 53.960 -11558  84.8564 < 2.2e-16 ***
## Functional      7    1.3865 53.791 -11579  10.8061 1.631e-13 ***
## House.Style     7    0.6338 53.038 -11620   4.9400 1.448e-05 ***
## Garage.Cars     1    2.6991 55.103 -11497 147.2547 < 2.2e-16 ***
## Sale.Condition  5    1.3540 53.758 -11577  14.7736 2.469e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Summary of the model 
summary(lm_fit_focus)
## 
## Call:
## lm(formula = log(SalePrice) ~ Gr.Liv.Area + MS.Zoning + Neighborhood + 
##     Overall.Qual + Overall.Cond + Bldg.Type + Year.Built + Foundation + 
##     Total.Bsmt.SF + Fireplaces + Functional + House.Style + Garage.Cars + 
##     Sale.Condition, data = d_house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98444 -0.06388  0.00055  0.06949  0.54778 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.355e+00  4.596e-01   9.475  < 2e-16 ***
## Gr.Liv.Area            2.728e-04  1.075e-05  25.378  < 2e-16 ***
## MS.ZoningC (all)       2.378e-01  1.173e-01   2.027 0.042773 *  
## MS.ZoningFV            4.593e-01  1.155e-01   3.978 7.12e-05 ***
## MS.ZoningI (all)       4.719e-01  1.511e-01   3.122 0.001812 ** 
## MS.ZoningRH            4.301e-01  1.163e-01   3.697 0.000222 ***
## MS.ZoningRL            4.402e-01  1.130e-01   3.897 9.97e-05 ***
## MS.ZoningRM            4.093e-01  1.134e-01   3.610 0.000312 ***
## NeighborhoodBlueste   -3.621e-03  5.340e-02  -0.068 0.945936    
## NeighborhoodBrDale    -3.259e-02  4.272e-02  -0.763 0.445692    
## NeighborhoodBrkSide   -2.495e-03  3.547e-02  -0.070 0.943919    
## NeighborhoodClearCr    1.274e-01  3.542e-02   3.596 0.000328 ***
## NeighborhoodCollgCr    3.359e-02  2.918e-02   1.151 0.249789    
## NeighborhoodCrawfor    1.454e-01  3.225e-02   4.509 6.78e-06 ***
## NeighborhoodEdwards   -5.345e-02  3.105e-02  -1.722 0.085262 .  
## NeighborhoodGilbert   -6.650e-03  3.029e-02  -0.220 0.826251    
## NeighborhoodGreens     1.339e-01  5.614e-02   2.385 0.017121 *  
## NeighborhoodGrnHill    5.854e-01  1.004e-01   5.832 6.10e-09 ***
## NeighborhoodIDOTRR    -4.097e-02  3.902e-02  -1.050 0.293812    
## NeighborhoodLandmrk    2.845e-02  1.420e-01   0.200 0.841294    
## NeighborhoodMeadowV   -9.833e-02  3.976e-02  -2.473 0.013455 *  
## NeighborhoodMitchel    1.492e-02  3.134e-02   0.476 0.634130    
## NeighborhoodNAmes      5.547e-03  3.018e-02   0.184 0.854180    
## NeighborhoodNoRidge    1.073e-01  3.281e-02   3.272 0.001081 ** 
## NeighborhoodNPkVill   -1.051e-02  4.028e-02  -0.261 0.794157    
## NeighborhoodNridgHt    1.606e-01  2.897e-02   5.543 3.24e-08 ***
## NeighborhoodNWAmes    -1.722e-02  3.118e-02  -0.552 0.580831    
## NeighborhoodOldTown   -5.113e-02  3.584e-02  -1.427 0.153822    
## NeighborhoodSawyer    -3.345e-04  3.151e-02  -0.011 0.991530    
## NeighborhoodSawyerW    8.425e-04  3.053e-02   0.028 0.977986    
## NeighborhoodSomerst    7.382e-02  3.472e-02   2.126 0.033569 *  
## NeighborhoodStoneBr    1.913e-01  3.263e-02   5.862 5.09e-09 ***
## NeighborhoodSWISU     -3.927e-03  3.727e-02  -0.105 0.916107    
## NeighborhoodTimber     9.071e-02  3.208e-02   2.828 0.004718 ** 
## NeighborhoodVeenker    8.291e-02  3.924e-02   2.113 0.034696 *  
## Overall.Qual           6.800e-02  3.488e-03  19.498  < 2e-16 ***
## Overall.Cond           5.353e-02  2.690e-03  19.898  < 2e-16 ***
## Bldg.Type2fmCon       -8.760e-04  1.826e-02  -0.048 0.961735    
## Bldg.TypeDuplex       -1.059e-01  1.542e-02  -6.870 7.87e-12 ***
## Bldg.TypeTwnhs        -1.359e-01  1.996e-02  -6.810 1.18e-11 ***
## Bldg.TypeTwnhsE       -6.155e-02  1.290e-02  -4.771 1.93e-06 ***
## Year.Built             2.887e-03  2.247e-04  12.847  < 2e-16 ***
## FoundationCBlock       2.703e-02  1.134e-02   2.383 0.017231 *  
## FoundationPConc        5.145e-02  1.284e-02   4.008 6.28e-05 ***
## FoundationSlab        -4.608e-02  2.598e-02  -1.773 0.076269 .  
## FoundationStone        9.615e-02  4.280e-02   2.246 0.024758 *  
## FoundationWood        -1.760e-02  6.258e-02  -0.281 0.778543    
## Total.Bsmt.SF          7.883e-05  1.108e-05   7.113 1.43e-12 ***
## Fireplaces             4.488e-02  4.872e-03   9.212  < 2e-16 ***
## FunctionalMaj2        -8.162e-02  5.617e-02  -1.453 0.146351    
## FunctionalMin1         7.524e-02  3.652e-02   2.060 0.039449 *  
## FunctionalMin2         6.411e-02  3.634e-02   1.764 0.077850 .  
## FunctionalMod          6.491e-02  3.990e-02   1.627 0.103864    
## FunctionalSal         -6.887e-01  1.180e-01  -5.837 5.91e-09 ***
## FunctionalSev         -2.046e-01  1.021e-01  -2.004 0.045198 *  
## FunctionalTyp          8.787e-02  3.264e-02   2.692 0.007146 ** 
## House.Style1.5Unf     -3.987e-02  3.289e-02  -1.212 0.225625    
## House.Style1Story      1.052e-02  1.105e-02   0.952 0.341269    
## House.Style2.5Fin     -4.609e-02  5.120e-02  -0.900 0.368066    
## House.Style2.5Unf      1.475e-02  3.004e-02   0.491 0.623525    
## House.Style2Story     -2.900e-02  1.093e-02  -2.652 0.008049 ** 
## House.StyleSFoyer      6.858e-02  1.919e-02   3.573 0.000358 ***
## House.StyleSLvl        7.619e-03  1.560e-02   0.488 0.625249    
## Garage.Cars            5.773e-02  4.757e-03  12.135  < 2e-16 ***
## Sale.ConditionAdjLand  1.377e-01  4.170e-02   3.303 0.000969 ***
## Sale.ConditionAlloca   9.742e-02  3.175e-02   3.068 0.002175 ** 
## Sale.ConditionFamily   4.257e-02  2.253e-02   1.890 0.058877 .  
## Sale.ConditionNormal   8.553e-02  1.063e-02   8.044 1.26e-15 ***
## Sale.ConditionPartial  1.002e-01  1.464e-02   6.843 9.45e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1354 on 2859 degrees of freedom
## Multiple R-squared:  0.8922, Adjusted R-squared:  0.8896 
## F-statistic: 347.9 on 68 and 2859 DF,  p-value: < 2.2e-16

5.3.3 Predictor summary

To sum up, we started with intuition analysis, numerical variable, then categorical variable analysis. Afterward, the step wise selection were conducted. Next, the drop1 function to test the relevant effect of different variables on the response variables. Below is the finding summary of variables, which we focus from different methods:

6 Train and test data

The data are divided into 70% train data and 30% test data because there are 2930 observations, which are large enough for that ratio. Moreover, we use function createDataPartition(), which helps to have more random split of the data.

From the scatter plots in the numerical analysis part, there are some outliers, which do not follow the trend in the Gr.Liv.Area and Total.Bsmt.SF. Thus, we remove them first before fitting the model because keeping them will make the errors become extremely high and there are only 3 outliers.

set.seed(12)
# We choose 70% for train data
indices <- createDataPartition(d_house$SalePrice, p = 0.70, list = F)

# Remove outliers from Gr.Liv.Area & Total.Bmst.SF & Garage.Cars
d_house <- d_house %>%
  filter(Gr.Liv.Area < 4500 & Total.Bsmt.SF < 4500)

# Train data
d_train <- d_house %>%
  slice(indices)

# Test data
d_test <- d_house %>%
  slice(-indices) 

7 Linear model

7.1 Fitting several models

To choose the best linear model, we fit several models on the train data and check the errors on the test data. At the end, we chose the best result for linear model. The fitting model process, we will use out sample cross validation, meaning that fitting model with train data and predict on the test data:

# Model based on stepwise variables
lm1_train <- lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Overall.Cond + 
                  Year.Built + Total.Bsmt.SF + Fireplaces + Garage.Cars + 
                  MS.Zoning + Bldg.Type + Functional, 
                data = d_train)
# Predict the model on test data
pred_lm1_test <- predict(lm1_train, newdata = d_test)

# R^2
r_squared_lm1 <- cor(exp(pred_lm1_test), d_test$SalePrice)^2
# RMSE
## Way 1: manually
sqrt(mean((exp(pred_lm1_test) - d_test$SalePrice)^2))
## [1] 25342.04
## Way 2: by function
accuracy(exp(pred_lm1_test), d_test$SalePrice)
##                ME     RMSE      MAE       MPE     MAPE
## Test set 177.2329 25342.04 17914.25 -1.516647 10.26084
# more info about the model
broom::glance(lm1_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.904         0.902 0.128      791.       0    24  1321. -2590. -2444.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# More simple model 
lm2_train <- lm(log(SalePrice) ~ Gr.Liv.Area + Year.Built + Overall.Qual + 
                  MS.Zoning, data = d_train)
# Predict the model on test data
pred_lm2_test <- predict(lm2_train, newdata = d_test)
# R^2
r_squared_lm2 <- cor(exp(pred_lm2_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm2_test), d_test$SalePrice)
##                ME    RMSE      MAE       MPE     MAPE
## Test set 1051.103 31052.2 22221.67 -2.483714 13.17617
# more info about the model
broom::glance(lm2_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.836         0.835 0.166     1154.       0     9   775. -1528. -1466.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model  
lm3_train <- lm(log(SalePrice) ~ Gr.Liv.Area + MS.Zoning +
                      Overall.Qual + Bldg.Type + Year.Built + Foundation + 
                      Total.Bsmt.SF + House.Style + Garage.Cars + Sale.Condition,
                na.action=na.omit,
                data = d_train)
# Predict the model on test data
pred_lm3_test <- predict(lm3_train, newdata = d_test)
# R^2
r_squared_lm3 <- cor(exp(pred_lm3_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm3_test), d_test$SalePrice)
##                 ME     RMSE      MAE       MPE     MAPE
## Test set -182.8621 25619.09 18488.89 -2.107878 11.01322
# more info about the model
broom::glance(lm3_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.883         0.881 0.141      477.       0    32  1124. -2181. -1989.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model 
lm4_train <- lm(log(SalePrice) ~ Gr.Liv.Area +
                      Overall.Qual + Bldg.Type + Year.Built +
                      Garage.Cars + Sale.Condition,
                na.action=na.omit,
                data = d_train)
# Predict the model on test data
pred_lm4_test <- predict(lm4_train, newdata = d_test)
# R^2
r_squared_lm4 <- cor(exp(pred_lm4_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm4_test), d_test$SalePrice)
##                ME     RMSE      MAE       MPE     MAPE
## Test set 790.6838 29495.33 21155.17 -2.117575 12.44964
# more info about the model
broom::glance(lm4_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.846         0.845 0.161      861.       0    13   841. -1653. -1568.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model with intuition variables
lm5_train <- lm(log(SalePrice) ~ Overall.Qual + Year.Built + Overall.Cond + 
                  MS.Zoning + Bldg.Type + Sale.Condition, 
                data = d_train)

# Predict the model on test data
pred_lm5_test <- predict(lm5_train, newdata = d_test)
# R^2
r_squared_lm5 <- cor(exp(pred_lm5_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm5_test), d_test$SalePrice)
##                ME     RMSE      MAE       MPE    MAPE
## Test set 4744.943 42136.58 28075.73 -1.688136 15.4458
# more info about the model
broom::glance(lm5_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.769         0.767 0.198      376.       0    18   425. -809. -697.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model with high correlated variables
lm6_train <- lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Year.Built + 
                  Full.Bath, 
                data = d_train)

# Predict the model on test data
pred_lm6_test <- predict(lm6_train, newdata = d_test)
# R^2
r_squared_lm6 <- cor(exp(pred_lm6_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm6_test), d_test$SalePrice)
##                ME     RMSE      MAE       MPE     MAPE
## Test set 558.1429 31714.33 22918.25 -2.895045 13.81076
# more info about the model
broom::glance(lm5_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.769         0.767 0.198      376.       0    18   425. -809. -697.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model with variables appearing in several analysis
lm7_train <- lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Year.Built + 
                  Total.Bsmt.SF + Garage.Cars + MS.Zoning + Bldg.Type + 
                  Functional, 
                data = d_train)
# Predict the model on test data
pred_lm7_test <- predict(lm7_train, newdata = d_test)
# R^2
r_squared_lm7 <- cor(exp(pred_lm7_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm7_test), d_test$SalePrice)
##                ME     RMSE     MAE       MPE     MAPE
## Test set 612.2026 26221.26 18767.6 -1.870345 11.14854
# more info about the model
broom::glance(lm7_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.882         0.881 0.141      688.       0    22  1113. -2178. -2043.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model altered
lm8_train <- lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Year.Built  + 
                  Garage.Cars + MS.Zoning + Bldg.Type + Functional, 
                data = d_train)
# Predict the model on test data
pred_lm8_test <- predict(lm8_train, newdata = d_test)
# R^2
r_squared_lm8 <- cor(exp(pred_lm8_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm8_test), d_test$SalePrice)
##               ME     RMSE     MAE      MPE    MAPE
## Test set 1433.44 29515.96 20865.3 -1.82409 12.1309
# more info about the model
broom::glance(lm8_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.862         0.860 0.153      601.       0    21   950. -1854. -1724.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Model altered
lm9_train <- lm(log(SalePrice) ~ Lot.Area + Gr.Liv.Area + Overall.Qual + Year.Built +
                  Overall.Cond + Bldg.Type + Fireplaces + Functional +
                  Total.Bsmt.SF + Garage.Area, 
                data = d_train)
# Predict the model on test data
pred_lm9_test <- predict(lm9_train, newdata = d_test)
# R^2
r_squared_lm9 <- cor(exp(pred_lm9_test), d_test$SalePrice)^2
# RMSE
accuracy(exp(pred_lm9_test), d_test$SalePrice)
##                 ME     RMSE      MAE       MPE  MAPE
## Test set -379.6749 25211.47 17442.98 -1.701891 10.12
# more info about the model
broom::glance(lm9_train)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.901         0.900 0.129      977.       0    19  1298. -2555. -2437.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

7.2 The final linear model

After fitting several models above, the final model is as below. In this part, we first fit the model

# Fit the final model with the whole dataset
lm_final <- lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Overall.Cond + 
                  Year.Built + Total.Bsmt.SF + Fireplaces + Garage.Cars + 
                  MS.Zoning + Bldg.Type + Functional, 
                data = d_house)
# Predict the model 1 on test data
summary(lm_final)
## 
## Call:
## lm(formula = log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Overall.Cond + 
##     Year.Built + Total.Bsmt.SF + Fireplaces + Garage.Cars + MS.Zoning + 
##     Bldg.Type + Functional, data = d_house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61466 -0.07242  0.00078  0.07467  0.61479 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.489e+00  2.864e-01  12.181  < 2e-16 ***
## Gr.Liv.Area       2.874e-04  7.053e-06  40.742  < 2e-16 ***
## Overall.Qual      7.778e-02  3.001e-03  25.919  < 2e-16 ***
## Overall.Cond      5.650e-02  2.500e-03  22.598  < 2e-16 ***
## Year.Built        3.337e-03  1.328e-04  25.134  < 2e-16 ***
## Total.Bsmt.SF     1.792e-04  7.282e-06  24.606  < 2e-16 ***
## Fireplaces        4.576e-02  4.441e-03  10.305  < 2e-16 ***
## Garage.Cars       5.439e-02  4.400e-03  12.360  < 2e-16 ***
## MS.ZoningC (all)  1.113e-01  1.114e-01   0.999 0.317796    
## MS.ZoningFV       3.639e-01  1.090e-01   3.338 0.000854 ***
## MS.ZoningI (all)  4.446e-01  1.439e-01   3.090 0.002018 ** 
## MS.ZoningRH       2.933e-01  1.113e-01   2.635 0.008455 ** 
## MS.ZoningRL       3.290e-01  1.084e-01   3.036 0.002419 ** 
## MS.ZoningRM       2.607e-01  1.085e-01   2.403 0.016305 *  
## Bldg.Type2fmCon  -1.032e-02  1.734e-02  -0.595 0.551764    
## Bldg.TypeDuplex  -1.206e-01  1.345e-02  -8.970  < 2e-16 ***
## Bldg.TypeTwnhs   -1.188e-01  1.435e-02  -8.280  < 2e-16 ***
## Bldg.TypeTwnhsE  -1.997e-02  1.006e-02  -1.985 0.047237 *  
## FunctionalMaj2   -2.605e-02  5.348e-02  -0.487 0.626146    
## FunctionalMin1    1.042e-01  3.464e-02   3.009 0.002642 ** 
## FunctionalMin2    1.074e-01  3.450e-02   3.113 0.001867 ** 
## FunctionalMod     9.151e-02  3.793e-02   2.413 0.015902 *  
## FunctionalSal    -7.461e-01  1.115e-01  -6.691 2.66e-11 ***
## FunctionalSev    -3.106e-01  9.775e-02  -3.177 0.001502 ** 
## FunctionalTyp     1.207e-01  3.101e-02   3.891 0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1307 on 2900 degrees of freedom
## Multiple R-squared:  0.8981, Adjusted R-squared:  0.8972 
## F-statistic:  1064 on 24 and 2900 DF,  p-value: < 2.2e-16
# texreg::texreg(lm_final)
# texreg::htmlreg(lm_final)

7.2.1 Estimated coefficients

Let’s look at the estimated regression coefficients.

# All coefficients
coef(lm_final)
##      (Intercept)      Gr.Liv.Area     Overall.Qual     Overall.Cond 
##     3.4891896552     0.0002873522     0.0777771151     0.0564954917 
##       Year.Built    Total.Bsmt.SF       Fireplaces      Garage.Cars 
##     0.0033368835     0.0001791800     0.0457646255     0.0543900218 
## MS.ZoningC (all)      MS.ZoningFV MS.ZoningI (all)      MS.ZoningRH 
##     0.1113398496     0.3638808679     0.4445522047     0.2932750207 
##      MS.ZoningRL      MS.ZoningRM  Bldg.Type2fmCon  Bldg.TypeDuplex 
##     0.3289933119     0.2606826556    -0.0103184153    -0.1206254458 
##   Bldg.TypeTwnhs  Bldg.TypeTwnhsE   FunctionalMaj2   FunctionalMin1 
##    -0.1188132690    -0.0199651960    -0.0260542727     0.1042372769 
##   FunctionalMin2    FunctionalMod    FunctionalSal    FunctionalSev 
##     0.1074237350     0.0915074060    -0.7460983198    -0.3105731682 
##    FunctionalTyp 
##     0.1206547519
## Interpreting the slope
exp(lm_final$coefficients["(Intercept)"])
## (Intercept) 
##    32.75939
## interpreting overall quality slope
(exp(lm_final$coefficients["Overall.Qual"])-1)*100
## Overall.Qual 
##     8.088172
## building type "Duplex"
(lm_final$coefficients["Bldg.TypeDuplex"])
## Bldg.TypeDuplex 
##      -0.1206254

Interpretation

  • The intercept: When every predictors is zero, the house price is 32.7593905 (extrapolation)
  • The slope of Overall.Qual: For every one-scale increase in the overall quality (assuming all other predictors remain constant), the house price increases by about 8.088172%.
  • Categorical variable Bldg.Type: Here, the 1Fam in the building type is used as the reference level. The -0.1206254 of Duplex is the difference of the intercept to 1Fam.

If we are interested in estimating the magnitude of an effect and its uncertainty, let’s look at the confident intervals for the lm_final model.

confint(lm_final)
##                          2.5 %        97.5 %
## (Intercept)       2.9275485417  4.0508307687
## Gr.Liv.Area       0.0002735229  0.0003011815
## Overall.Qual      0.0718932356  0.0836609947
## Overall.Cond      0.0515935831  0.0613974003
## Year.Built        0.0030765613  0.0035972058
## Total.Bsmt.SF     0.0001649018  0.0001934582
## Fireplaces        0.0370565741  0.0544726769
## Garage.Cars       0.0457616371  0.0630184064
## MS.ZoningC (all) -0.1071547868  0.3298344860
## MS.ZoningFV       0.1501495438  0.5776121920
## MS.ZoningI (all)  0.1624898015  0.7266146078
## MS.ZoningRH       0.0750532782  0.5114967631
## MS.ZoningRL       0.1165124965  0.5414741272
## MS.ZoningRM       0.0480105093  0.4733548020
## Bldg.Type2fmCon  -0.0443112778  0.0236744472
## Bldg.TypeDuplex  -0.1469942802 -0.0942566115
## Bldg.TypeTwnhs   -0.1469504384 -0.0906760996
## Bldg.TypeTwnhsE  -0.0396866892 -0.0002437028
## FunctionalMaj2   -0.1309097432  0.0788011979
## FunctionalMin1    0.0363176452  0.1721569087
## FunctionalMin2    0.0397716598  0.1750758102
## FunctionalMod     0.0171360309  0.1658787811
## FunctionalSal    -0.9647556503 -0.5274409892
## FunctionalSev    -0.5022309701 -0.1189153663
## FunctionalTyp     0.0598569079  0.1814525959

This final model were chosen based on the lowest RMSE and the highest adjusted R squared from the previous part. Notably, the adjusted R-squared to compare the goodness-of-fit for regression models that contain differing numbers of independent variables. That’s why we use this to evaluate the models.

7.2.2 Prediction

It’s important to examine visually the prediction errors of the model. We present predicted value in the original unit of the house price by applying the function exp().

# Make predictions from the model, exponential back the value
d_house_lm <- d_house %>% 
  mutate(pred_lm = exp(predict(lm_final)))

# Plot the predictions (on x axis) versus the house price
ggplot(d_house_lm, aes(x = pred_lm, y = SalePrice)) + 
  geom_point() + 
  geom_abline() +
  labs(title = "The actual Sale Price and the predicted Sale Price from linear model") +
  my_theme_general

# Calculate residuals
d_house_lm$residuals <- d_house_lm$SalePrice - d_house_lm$pred_lm

# Plot predictions (on x-axis) versus the residuals
ggplot(d_house_lm, aes(x = pred_lm, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("Residuals vs. Linear Model prediction") +
  my_theme_general

Now, we will also plot the gain curve of the lm_final’s predictions against actual SalePrice. For situations where order is more important than exact values, the gain curve helps us check if the model’s predictions sort in the same order as the true outcome. When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

# Plot the Gain Curve
WVPlots::GainCurvePlot(d_house_lm, "pred_lm", "SalePrice", "lm_final")

A relative gini coefficient of 0.96, \(\approx\) 1, shows that the model correctly sorts high house price situations from lower ones.

7.2.3 Assessing model accuracy

7.2.3.1 In-sample performance

We start with “in-sample” performance, which means the accuracy measured on models fitted to all data available.

# R squared of the final model
glance_lm_final <- broom::glance(lm_final)
glance_lm_final
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.898         0.897 0.131     1064.       0    24  1814. -3576. -3420.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# RMSE of the final model
accuracy(exp(predict(lm_final)), d_house$SalePrice)
##                ME     RMSE      MAE        MPE     MAPE
## Test set 1500.229 24732.18 16792.42 -0.9031386 9.576959

Observations:

  • R-squared = 0.8980557
  • Adjusted R-squared = 0.897212

The model explains 89% of the variance. Does this model is also better at making predictions on new observations that have not been used to train the model? Thus, we move to out-sample performance.

7.2.3.2 Out-sample performance

We now fit a part of the data to train the model and then test its predictive performance on a part of data that was splitted aside.

# Fit the model on the train data
lm_final_train <- lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Overall.Cond + 
                  Year.Built + Total.Bsmt.SF + Fireplaces + Garage.Cars + 
                    Bldg.Type + Functional, 
                data = d_train)
# Predict the final model on test data
pred_lm_final_test <- predict(lm_final_train, newdata = d_test)

# R^2
r_squared_lm_final <- cor(exp(pred_lm_final_test), d_test$SalePrice)^2

# RMSE
accuracy_lm_final <- accuracy(exp(pred_lm_final_test), d_test$SalePrice)

# The standard deviation of the outcome
sd(d_house$SalePrice)
## [1] 79929.9

Observations:

  • R-squared out-sample = 0.9012988: even higher than R squared in-sample
  • RMSE = 2.5450574^{4}: the standard deviation of the prediction errors is much smaller than the standard deviation of the house price (i.e., 7.9929896^{4} USD)
  • MAPE = 10.4259788: on average, the model predicts with approximately 10% errors.

The cross-validation R-squared (measured on out-of-sample) is slightly higher than in-sample R-squared. The chosen linear model is therefore concluded as a good model because it does not overfit the data.

7.2.3.3 Cross-validation

Previously, we fix the train and test data, does the result change if the data was split differently? To ensure about the model, we will make 3-way cross-validation as follows:

# Remove some observations with very few cases in 2 variables MS.Zoning and Functional 
d_house_lm <- d_house_lm %>%  
  filter(!MS.Zoning %in% c("A (agr)", "I (all)"), 
         !Functional %in% c("Sal", "Sev"))

# Create a cross-validation plan with 3 folds (partitions)
splitPlan <- kWayCrossValidation(nrow(d_house_lm), 3, NULL, NULL)
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
d_house_lm$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + Overall.Cond + 
                  Year.Built + Total.Bsmt.SF + Fireplaces + Garage.Cars +
                MS.Zoning + Bldg.Type + Functional, 
              data = d_house_lm[split$train,])
  d_house_lm$pred.cv[split$app] <- predict(model, 
                                           newdata = d_house_lm[split$app, ])
}
# Predict from a full model
d_house_lm$pred <- predict(lm(log(SalePrice) ~ Gr.Liv.Area + Overall.Qual + 
                                Overall.Cond + Year.Built + Total.Bsmt.SF + 
                                Fireplaces + Garage.Cars + MS.Zoning + 
                                Bldg.Type + Functional, data = d_house_lm))
# Get the rmse of the full model's predictions
ModelMetrics::rmse(exp(d_house_lm$pred), d_house_lm$SalePrice)
## [1] 24763.07
# Get the rmse of the cross-validation predictions
ModelMetrics::rmse(exp(d_house_lm$pred.cv), d_house_lm$SalePrice)
## [1] 24905.87

7.3 Generalised Additive Model

Not all predictors have a linear relationship with the response variable. In fact, there can be non-linear effects such as quadratic or polynomial or higher degree of complexity. Besides, whenever building models, we tend to face a trade-off between flexibility and interpretability. GAM provides us a middle ground between simple models such as linear regression, and complex machine learning models like SVM. GAM is an additive modeling method where the impact of the predictive variables is formulated through smooth functions depending on the underlying patterns in the data can be nonlinear (Goetz et al., 2015).

However, we regularly have to model more advanced phenomena than can be represented by linear relationships. Machine learning models, e.g. NN or SVM, can be very good at making predictions of complex relationships. The challenge with those Machine Learning models is that they tend to need lots of data which are quite difficult to interpret, and one can rarely make conclusions from the model results.

As mentioned above, GAMs offer a middle ground: they can be fit to complex, nonlinear relationships and make predictions, but it is easier to understand the underlying structure of our models and its predictions. To create the models, we have chosen some more focused variables which were defined in the chapter variable selection. First, let’s look at the relationship of predictors to the dependent variables with smooth trend lines:

Different from the linear assumption, some variables have non-linear effect with the price. In this section, we fit two models on the train data and check the errors on the test data. Then, we choose the best fitting model based on our results.

Let’s start with the first model. For consistency, we have chosen the variables which resulted from the variable analysis in the EDA. We also have to define the basis dimension k, which allows for great flexibility in curve fitting.

7.3.1 Model 1

# Fitting model
gam1_train <- gam(log(SalePrice) ~ s(Gr.Liv.Area, k=10)+ s(Overall.Qual, k=10) + 
                    s(Overall.Cond, k=9)+ s(Year.Built, k=10) + 
                    s(Total.Bsmt.SF, k=10)+ s(Fireplaces, k=2)+ 
                    s(Garage.Cars, k=3) + MS.Zoning + Bldg.Type + Functional, 
                  data=d_train, method = "REML")
# Summary of the model
summary(gam1_train)

The deviance explained is 91.1% and the adjusted R-Square is 0.909.

The summary output indicates a strong evidence that Gr.Liv.Area, Overall.Qual, Overall.Cond, Year.Built, Total.Bsmt.SF, Fireplaces and Garage.Cars have effects on the response variable. Additionally, it implies that the categorical variables Bldg.TypeDuplex, Bldg.TypeTwnhs, FunctionalMin1, FunctionalMin2, FunctionalMod, FunctionalSal, FunctionalSev and FunctionalTyp have an effect on the response variable.

The estimated degrees of freedom, edf in the output, quantify the complexity of the smooth function. If we look at the edf, it shows how complex the relationship between each predictor to the SalePrice. The predictor Overall.Qual’s edf is about 6.313. Let’s visualise the effect of this variable:

In this plot, the GAM fitted the model to the data points. This is different from linear regression, which only models linear relationships. On the other hand, the edf of Garage.Cars is almost one, meaning that this term is considered to be a simple linear term. GAMs enables to fit a more flexible model.

We now predict on the test data and check the R-squared and RMSE.

# Predict gam1 on the test data
pred_gam1_test <- predict(gam1_train, newdata = d_test)
# R squared
cor(exp(pred_gam1_test), d_test$SalePrice)^2
## [1] 0.9115518
# Accuracy of the model
gam1_accuracy <- accuracy(as.vector(exp(pred_gam1_test)), d_test$SalePrice)
gam1_accuracy
##                ME     RMSE     MAE       MPE     MAPE
## Test set 499.3159 24083.01 16994.3 -1.264955 9.786083

The out-sample R-squared is 0.9115518 and RMSE is 2.4083008^{4}.

Let’s use the function gam.check() to produce some diagnostic information about the fitting procedure and results:

In the first graph, there is the linear relationship between quantiles and residuals, which slightly deviates. Most of the residuals follow a normal distribution, however on the tails it more fat-tailed than normal distribution. In the second plot on top, we do not see a clear pattern from the residuals, they are scattered around 0.

7.3.2 Model 2

In our second model, we add some interactions within the variables with the function te(). te() terms are the smooth equivalent of main effects + interactions. Additionally, we can also specify a GAM formula that will fit different smooths for different categorical variables, as known as factor-smooth interaction. By specifying the “by” argument to the s() function, R will calculate a different smooth for each unique category. During the realization of this model, several interactions were tried between variables and the following model was the most successful:

# Fitting model gam 2
gam2_train <- gam(log(SalePrice) ~ te(Gr.Liv.Area, Overall.Qual, k=10) + 
                    s(Year.Built, k=20) +  s(Overall.Cond, k=9, by=MS.Zoning) + 
                    s(Total.Bsmt.SF, k=20, by=Bldg.Type) + s(Fireplaces, k=2) + 
                    s(Garage.Cars, k=4) + MS.Zoning + Bldg.Type + Functional, 
                  data=d_train, method = "REML")
# Summary of the model
summary(gam2_train)

The deviance explained is 91.9% and the adjusted R-Square is 0.916. This number is slightly higher than our first model. We now predict our data on the test data:

# Predict gam 2 on the test data
pred_gam2_test <- predict(gam2_train, newdata = d_test)
# R squared
cor(exp(pred_gam2_test), d_test$SalePrice)^2
## [1] 0.9067852
# Accuracy of the model
gam2_accuracy <- accuracy(as.vector(exp(pred_gam2_test)), d_test$SalePrice)
gam2_accuracy
##                ME     RMSE      MAE       MPE     MAPE
## Test set 452.2323 24739.94 17211.06 -1.307563 9.945006

Here is the brief observations regarding to the two models accuracy:

  • Adjusted R-squared of the model: model 1 < model 2
  • R-squared of the out-sample: model 1 (0.911) > model 2 (0.907)
  • RMSE of the out-sample: model 1 (2.4083008^{4}) < model 2 (2.4739943^{4})

Let’s now check our model with the gam.check() function:

Similarly, we do not find any obvious patterns in the residual plots for the model 2.

Generally, we can assume that the model is good, however, some outliers cannot be explained with it and the residuals of both models have fat tails. In conclusion, the first model is slightly better and therefore selected, because its lower RMSE and higher R-squared of the out-sample.

8 Generalised Linear Model with family set to Poisson

8.1 Predicting number of cars

The house price, our main response variable is a continuous variable. In this part, we will predict another variable which is integer: Garage.Cars: The number of cars that a garage can have.

We fit the model with these predictors: Garage.Area, SalePrice and Land.Contour (Flatness of the property with 4 levels: 1) Lvl: Near Flat/Level, 2) Bnk: Banked - Quick and significant rise from street grade to building, 3) HLS: Hillside - Significant slope from side to side, 4) Low: Depression)

However, this is just an example to fit a poisson model based on our dataset. We are fully aware that the garage cars are not really count in the sense of Poisson distribution assumption. More particularly, the variances of the observations does not linearly increases with the mean value. For instance, the box plot below shows no big difference between groups of Land.Contour. Similarly, we tried to make box plot of Garage.Cars with many different categorical variables to investigate. The dataset only consists the car number from 0 to 5. A house commonly does not have 20 cars. Other potential count variables such as Fireplaces and Full.Bath have the same issue. Hence, we just fit a Poisson model to illustrate:

# Boxplot of Garage.Cars with Land.Contour
ggplot(data = d_house, mapping = aes(y = Garage.Cars, x = Land.Contour)) +
  geom_hline(yintercept = 0) + ## to highlight the lower bound of the data
  geom_boxplot() 

# Fit the poisson model to predict number of garage cars
glm1_poisson <- glm(Garage.Cars ~ Garage.Area + SalePrice + Land.Contour, 
                   family = "poisson", ## we specify the distribution!
                   data = d_house)

## summary Poisson model
summary(glm1_poisson)
## 
## Call:
## glm(formula = Garage.Cars ~ Garage.Area + SalePrice + Land.Contour, 
##     family = "poisson", data = d_house)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.86236  -0.19296   0.02656   0.24142   1.55541  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.501e-01  8.437e-02  -4.150 3.33e-05 ***
## Garage.Area      1.598e-03  7.844e-05  20.369  < 2e-16 ***
## SalePrice        3.719e-07  1.947e-07   1.910   0.0561 .  
## Land.ContourHLS  4.872e-02  1.040e-01   0.468   0.6395    
## Land.ContourLow -3.894e-02  1.262e-01  -0.309   0.7577    
## Land.ContourLvl  3.077e-02  8.195e-02   0.375   0.7073    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1210.44  on 2924  degrees of freedom
## Residual deviance:  470.83  on 2919  degrees of freedom
## AIC: 7392.7
## 
## Number of Fisher Scoring iterations: 4
## Coefficient: Land Coutour 
exp(coef(glm1_poisson)["Land.ContourHLS"]) %>% round(digits = 2)
## Land.ContourHLS 
##            1.05
## Coefficient: Garage.Area
exp(coef(glm1_poisson)["Garage.Area"])
## Garage.Area 
##    1.001599
## Coefficient: Sale Price
exp(coef(glm1_poisson)["SalePrice"]*10000)
## SalePrice 
##  1.003726

Interpretation of Coefficient of Land Coutour:

R takes Land.ContourBnk (i.e., Banked - Quick and significant rise from street grade to building) as the reference. Hence, after applying exponential, the hillside (significant slope from side to side) land contour get, on average, 5% more garage cars than the banked land contour.

For a given home, increasing its number of Garage area by one square feet, would results in about 0.15% more cars. For a given home, increasing its price by 10000 USD, would results in about 3% more cars.

8.2 Predicting number of rooms

The house price, our main response variable is a continuous variable. In this part, we will predict another variable which is integer: TotRms.AbvGrd: The number of rooms above ground.

We fit the model with these predictors: SalePrice and Bldg.Type, Year.Built to determine TotRms.AbvGrd.

This just an example to fit a poisson model based on our dataset. We are fully aware that rooms are house datasets not considereded as count data in the sense of Poisson distribution assumption. More particularly, the variances of the observations does not linearly increases with the mean value. For instance, ploting the room numbers with Bldg.Type shows an increase in variability, however this is not linearly increasing. With the quasipoisson we identify that we dispersion parameter below 1 and therefore we have underdispersed data

# Boxplot of rooms with Building types
ggplot(data = d_house, mapping = aes(y = TotRms.AbvGrd, x = Bldg.Type)) +
  geom_hline(yintercept = 0) + ## to highlight the lower bound of the data
  geom_boxplot() 

# Fit the poisson model to predict number of rooms
glm2_poisson <- glm(TotRms.AbvGrd ~ Bldg.Type + Year.Built + SalePrice,
                   family = "poisson", ## we specify the distribution!
                   data = d_house)
## summary Poisson model
summary(glm2_poisson)
## 
## Call:
## glm(formula = TotRms.AbvGrd ~ Bldg.Type + Year.Built + SalePrice, 
##     family = "poisson", data = d_house)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76385  -0.34233  -0.02191   0.28478   1.96107  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.718e+00  5.876e-01   6.327 2.49e-10 ***
## Bldg.Type2fmCon  1.626e-01  4.909e-02   3.312 0.000927 ***
## Bldg.TypeDuplex  2.904e-01  3.505e-02   8.286  < 2e-16 ***
## Bldg.TypeTwnhs  -9.226e-02  4.463e-02  -2.067 0.038704 *  
## Bldg.TypeTwnhsE -1.938e-01  3.061e-02  -6.332 2.42e-10 ***
## Year.Built      -1.098e-03  3.035e-04  -3.616 0.000299 ***
## SalePrice        1.670e-06  9.987e-08  16.717  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1078.23  on 2924  degrees of freedom
## Residual deviance:  657.17  on 2918  degrees of freedom
## AIC: 11490
## 
## Number of Fisher Scoring iterations: 4
## Coefficient: Year Built
exp(coef(glm2_poisson)["Year.Built"])
## Year.Built 
##   0.998903
## Coefficient: Sale Price
exp(coef(glm2_poisson)["SalePrice"]*10000)
## SalePrice 
##  1.016836
## Identify RSME and compare to Standard Deviation
accuracy(exp(predict(glm2_poisson)), d_house$TotRms.AbvGrd)
##                     ME    RMSE       MAE       MPE     MAPE
## Test set -5.517379e-14 1.22828 0.9232892 -3.742478 15.18902
sd(d_house$TotRms.AbvGrd)
## [1] 1.560002

Interpretation of Coefficients:

  • For a given year (between 1872 - 2010), increasing the year by 1 year would result in about 0.11% less rooms.
  • For a given home, increasing its price by 10000 USD, would results in about 1.7% more rooms.

The RMSE of the poisson model compared the the actual data is is 1.228. As this is smaller then the standard deviation of rooms in the house dataset.

glm3_quasipoisson <- glm(TotRms.AbvGrd ~ Bldg.Type + Year.Built + SalePrice,
                   family = "quasipoisson", ## we specify the distribution!
                   data = d_house)

## summary Quasipoisson model
summary(glm3_quasipoisson)
## 
## Call:
## glm(formula = TotRms.AbvGrd ~ Bldg.Type + Year.Built + SalePrice, 
##     family = "quasipoisson", data = d_house)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76385  -0.34233  -0.02191   0.28478   1.96107  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.718e+00  2.787e-01  13.339  < 2e-16 ***
## Bldg.Type2fmCon  1.626e-01  2.329e-02   6.982 3.60e-12 ***
## Bldg.TypeDuplex  2.904e-01  1.663e-02  17.467  < 2e-16 ***
## Bldg.TypeTwnhs  -9.226e-02  2.117e-02  -4.358 1.36e-05 ***
## Bldg.TypeTwnhsE -1.938e-01  1.452e-02 -13.349  < 2e-16 ***
## Year.Built      -1.098e-03  1.440e-04  -7.623 3.32e-14 ***
## SalePrice        1.670e-06  4.737e-08  35.242  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.2250152)
## 
##     Null deviance: 1078.23  on 2924  degrees of freedom
## Residual deviance:  657.17  on 2918  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
## Coefficient: Year Built
exp(coef(glm3_quasipoisson)["Year.Built"])
## Year.Built 
##   0.998903
## Coefficient: Sale Price
exp(coef(glm3_quasipoisson)["SalePrice"]*10000)
## SalePrice 
##  1.016836
## Identify RSME and compare to Standard Deviation
accuracy(exp(predict(glm3_quasipoisson)), d_house$TotRms.AbvGrd)
##                     ME    RMSE       MAE       MPE     MAPE
## Test set -5.517379e-14 1.22828 0.9232892 -3.742478 15.18902
sd(d_house$TotRms.AbvGrd)
## [1] 1.560002

The RMSE of the quasipoisson model compared the the actual data is is 1.228. As this is smaller then the standard deviation of rooms in the house dataset.

As the Poisson Models shows a dispersion parameter of 1. This information in the summary highlights that by fitting the Poisson Model, that the variance increases linearly with the mean. In a real setting, count data do not follow a Poisson distribution. If the dispersion parameter is higher than 1, we talk about overdispersed. This can be seen if the residual deviance is larger then the degrees of freedom. However, the opposite seems to be the case.

The quasipoisson model shows dispersion parameter is between 0.22. Therefore, the variance decrease slower then the mean value. We are therefore speaking about underdispersed data.

9 Generalised Linear Model with family set to Binomial

9.1 Predicting groups of price per square feet

In this part, we first calculate the house price per square feet by taking the SalePrice dividing by LotArea. To apply the binomial of GLM, two groups of price per square feet are created: upper and lower price. Afterward, we compute the ratio of upper price houses in the dataset based on their overall quality. Next, the GLM model is built on the rate of Price.Upper (number of successes) and Price.Lower (number of failures) to fulfill the binomial proportion.

# Create two groups of price according to the overall quality
d_house_glm1 <- d_house %>% 
  mutate(SalePrice.Per.SqrFeet = SalePrice / Lot.Area) %>% 
  group_by(Overall.Qual) %>%
  summarise(
    ## Upper price
    Price.Upper = sum(ifelse(SalePrice.Per.SqrFeet > mean(SalePrice.Per.SqrFeet), 1, 0)),
    ## Lower price
    Price.Lower = sum(ifelse(SalePrice.Per.SqrFeet <= mean(SalePrice.Per.SqrFeet), 1, 0)),
    ## Rate of upper
    Price.Upper.Rate = round(Price.Upper/(Price.Upper + Price.Lower), digits = 2))
glm1_binomial <- glm(cbind(Price.Upper, Price.Lower) ~ Overall.Qual,
                    family = "binomial",
                    data = d_house_glm1)

summary(glm1_binomial)
## 
## Call:
## glm(formula = cbind(Price.Upper, Price.Lower) ~ Overall.Qual, 
##     family = "binomial", data = d_house_glm1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9258  -1.6400   0.6369   1.6738   4.5645  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.54856    0.17503  -3.134  0.00172 **
## Overall.Qual -0.02835    0.02808  -1.009  0.31280   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 62.015  on 9  degrees of freedom
## Residual deviance: 60.995  on 8  degrees of freedom
## AIC: 115.25
## 
## Number of Fisher Scoring iterations: 4

After fitting the model, the overall quality seems not play a role in the price-per-square-feet groups. Indeed, its p-value is not significant. Also, the Overall.Qual slope is negative but we could not draw a conclusion since the variable is not significant in this case. However, we will still visualize the results of this model. The red dots are observed value and the predictions are a sequqnece of 100 observations that ranges from one to ten (black dots).

# Plot Effect-Overall Quality
new_data <- data.frame(Overall.Qual = seq(1, 10, length.out = 100))
new_data$pred_price_upper_ratio <- predict(glm1_binomial, newdata = new_data,
                                 type = "response")

##
ggplot(data = d_house_glm1,
       mapping = aes(y = Price.Upper.Rate,
                     x = Overall.Qual)) + 
  ylim(0,1) +
  geom_hline(yintercept = 0:1, col = "gray") +
  ## predictions for conc 0 --> 10
  geom_point(data = new_data,
               mapping = aes(
      y = pred_price_upper_ratio,
      x = Overall.Qual)) +
  ## actual observations
  geom_point(col = "red", 
             size = 3)

As expected, the model does not predict so well.

9.2 Predicting the central air

After the first model, we now attempt to predict whether a central air is in the house. Recall that the center air variable consist of Yes/No. Here, the Overall.Qual will be used as the predictor. Before fitting the model, let’s check the plot of reposnse and independent variables.

# Create Center Air: 1 for yes and 0 for no
d_house_glm2 <- d_house %>% 
  mutate(Central.Air.new = ifelse(Central.Air == "Y", 1, 0))

# a logistic regression model and add the fit to the this graph
ggplot(data = d_house_glm2, mapping = aes(y = Central.Air.new, x = Overall.Qual)) +
  geom_point() +
  geom_smooth(method = "glm", 
              se = FALSE,
              method.args = list(family = "binomial")) 

There are two only possible outcomes here as yes/no or success/failure (coded as 1 and 0 respectively). The higher overall quality seems to have central air but there are also some low quality observations with central air. Let’s fit the model as belows:

# Fitting model 2 to predict Central.Air
glm2_binomial <- glm(Central.Air.new ~ Overall.Qual,
                data = d_house_glm2,
                family = "binomial")

# Summary of the model
summary(glm2_binomial)
## 
## Call:
## glm(formula = Central.Air.new ~ Overall.Qual, family = "binomial", 
##     data = d_house_glm2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2810   0.1611   0.2695   0.4462   2.0212  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.94395    0.36933  -7.971 1.57e-15 ***
## Overall.Qual  1.04022    0.07433  13.995  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1438.1  on 2924  degrees of freedom
## Residual deviance: 1164.0  on 2923  degrees of freedom
## AIC: 1168
## 
## Number of Fisher Scoring iterations: 6

The overall quality has a positive effect on the response variable. To interpret the coefficient of overall quality, we apply exponential function. When the overall quality increases by one level, the probability of having central air increases 2.8 times. To evaluate the fitted values with the observed values (binary 0 or 1), it’s important to discretize the fitted value into 0 and 1. At this stage, the cutoff of 0.5 is used.

# Discretise the fitted value into 0 and 1 with the cutoff of 0.5
fitted_glm2_binomial <- ifelse(fitted(glm2_binomial) > 0.5,
                               yes = 1, no = 0)

# Make a data frame of observed and fitted values
d_obs_fit_glm2 <- data.frame(obs = d_house_glm2$Central.Air.new,
                             fitted = fitted_glm2_binomial)

# summarise them into a table
table(obs = d_obs_fit_glm2$obs,
      fit = d_obs_fit_glm2$fitted)
##    fit
## obs    0    1
##   0   14  182
##   1    3 2726

To compare the observed and fitted values after discretizing, we summarize them in the confusion matrix above. 14 observations were correctly labeled to be No and 2726 observations were correctly labeled to be Yes. Nevertheless, 182 were wrongly classified by the model to be Yes but in reality, they are No (false positives). Besides, the model made mistakes of 3 observations as being No while they were not in reality (false negative).

10 Support Vector Machines

SVM are supervised learning models and widely used for classification problems. Therefore, we predict the types of building in this chapter to apply this model. As we have 5 different categories, we combined 2fmCon (Two-family Conversion; originally built as one-family dwelling) and 1Fam (Single-family Detached). The reason is that 2fmCon was build as an one-family dwelling. Additionally, we combine TwnhsE (Townhouse End Unit) with Twnhs (Townhouse Inside Unit) because they are both “town house”.

10.1 Predictors

# Copying d_house as SVM Dataframe
d_svm <- d_house

# Creating Character for Renaming
d_svm$Bldg.Type <- as.character(d_svm$Bldg.Type)

# Regrouping and Renaming building type
d_svm$Bldg.Type[d_svm$Bldg.Type == "1Fam"] <- "Familycom"
d_svm$Bldg.Type[d_svm$Bldg.Type == "2fmCon"] <- "Familycom"
d_svm$Bldg.Type[d_svm$Bldg.Type == "Duplex"] <- "Duplex"
d_svm$Bldg.Type[d_svm$Bldg.Type == "Twnhs"] <- "Townhouse"
d_svm$Bldg.Type[d_svm$Bldg.Type == "TwnhsE"] <- "Townhouse"

# Returning Factor for building type
d_svm$Bldg.Type <- as.factor(d_svm$Bldg.Type)

# Inspecting distribution of building types
table(d_svm$Bldg.Type)
## 
##    Duplex Familycom Townhouse 
##       109      2482       334

To check which variables might be relevant to predict the building type, it’s necessary to make several boxplots to investigate some predictors:

fill <- "#FFFFFF"
line <- "#000000"

boxplot_theme_bldgtype <- function(variable, variable_input) {
  output <- ggplot(d_svm, aes(x = Bldg.Type, y = variable)) +
    geom_boxplot(fill = fill, colour = line) +
    ggtitle(variable_input) +
    theme_minimal() +
    scale_y_continuous(labels = comma) +
    labs(x = paste0("\n", "Building Type"), y = paste0("\n", variable_input)) + 
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1),
          legend.position = "bottom", legend.direction = "horizontal",
          legend.box = "horizontal",
          legend.key.size = unit(1, "cm"),
          plot.title = element_text(family="Tahoma", hjust = 0.5),
          text = element_text(family = "Tahoma"),
          axis.title = element_text(size = 10),
          legend.text = element_text(size = 9),
          legend.title=element_text(face = "bold"))

  return(output)
} 

# Several boxplot to check the effect of different variables on building type
boxplot_theme_bldgtype(log(d_svm$Lot.Area), "Lot.Area")

boxplot_theme_bldgtype(d_svm$Overall.Qual, "Overall.Qual")

boxplot_theme_bldgtype(d_svm$Year.Built, "Year.Built")

boxplot_theme_bldgtype(d_svm$TotRms.AbvGrd, "TotRms.AbvGrd")

boxplot_theme_bldgtype(d_svm$Garage.Area, "Garage.Area")

We checked several predictors, which might be the most suitable for predicting the Building Type. Due to box plots, we identified Lot.Area, Overall.Qual, Year.Built, TotRms.AbvGrd, Garage.Area, Bldg.Type to be the most suitable. To reduce omitting rows due to NA Values, we created a data frame with only the predefined predictors. We use the same train and testing tables as in the previous exercises.

# Logmarizing Lot Area due to be able to plot Lot Area
d_svm$Lot.Area <- log(d_svm$Lot.Area)

# Train data
d_train_svm <- d_svm %>%
  slice(indices) %>%
  select(Lot.Area, Overall.Qual, Year.Built, TotRms.AbvGrd, Garage.Area, Bldg.Type) %>%
  na.omit()

# Test data dropping the building type
d_test_svm <- d_svm %>%
  slice(-indices) %>% 
  select(Lot.Area, Overall.Qual, Year.Built, TotRms.AbvGrd, Garage.Area) %>% 
  na.omit()

# Test data truth
d_test_svm_truth <- d_svm %>%
  slice(-indices) %>%
  select(Lot.Area, Overall.Qual, Year.Built, TotRms.AbvGrd, Garage.Area, Bldg.Type) %>% 
  na.omit() %>% 
  pull(Bldg.Type)

table(d_svm$Bldg.Type)
## 
##    Duplex Familycom Townhouse 
##       109      2482       334

Familycomplex is clearly the biggest category. This fact can also been seen, if we plot the various predictors in a ggplot. Below we plot Lot.Area in the x-axis to each of the other 4 predictors.

plot_1 <- ggplot(data = d_svm, aes(x = Lot.Area, y = Overall.Qual, color = Bldg.Type)) +
  theme_minimal() +  geom_point(alpha = 0.3)

plot_2 <- ggplot(data = d_svm, aes(x = Lot.Area, y = Year.Built, color = Bldg.Type)) +
  theme_minimal() + geom_point(alpha = 0.3)

plot_3 <- ggplot(data = d_svm, aes(x = Lot.Area, y = TotRms.AbvGrd, color = Bldg.Type)) +
  theme_minimal() + geom_point(alpha = 0.3)

plot_4 <- ggplot(data = d_svm, aes(x = Lot.Area, y = Garage.Area, color = Bldg.Type)) +
  theme_minimal() + geom_point(alpha = 0.3)

egg::ggarrange(plot_1, plot_2, plot_3, plot_4, ncol=2, nrow = 2)

10.2 Tune for cost

With the selected predictors, we can perform a support vector machine with the kernel: Linear, Radial, Polynomial and Sigmoid. However, after fitting many models, Linear could not predict the Duplex Type, our analysis shows mainly on Radial as the best result has been seen with this kernel.

So as a first step, we identified the best cost for the the kernel Radial, which is 100.

## 
## Call:
## best.tune(method = svm, train.x = Bldg.Type ~ Lot.Area + Overall.Qual + 
##     Year.Built + TotRms.AbvGrd + Garage.Area, data = d_train_svm, 
##     ranges = list(cost = c(0.1, 1, 10, 100, 1000)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
## 
## Number of Support Vectors:  220

Therefore, we trained a model with the all the determined predictors and with the optimal cost for radial.

set.seed(12)
svm_radial_train <- svm(Bldg.Type ~ ., d_train_svm, kernel = "radial", 
                        scale = TRUE, cost = 100)

To inspect, how our model performs on a plot, we noted that due to the high amount of Familicomplex, we were not able to see a clear split on a two dimensional plot.

plot(svm_radial_train, d_train_svm, Year.Built ~ Overall.Qual)

plot(svm_radial_train, d_train_svm, Lot.Area ~ Year.Built)

plot(svm_radial_train, d_train_svm, Lot.Area ~ TotRms.AbvGrd)

plot(svm_radial_train, d_train_svm, Lot.Area ~ Garage.Area)

plot(svm_radial_train, d_train_svm, Lot.Area ~ Overall.Qual)

plot(svm_radial_train, d_train_svm, Year.Built ~ TotRms.AbvGrd)

plot(svm_radial_train, d_train_svm, Year.Built ~ Garage.Area)

plot(svm_radial_train, d_train_svm, Overall.Qual ~ TotRms.AbvGrd)

plot(svm_radial_train, d_train_svm, Overall.Qual ~ Garage.Area)

plot(svm_radial_train, d_train_svm, TotRms.AbvGrd ~ Garage.Area)

10.3 Assessing model accuracy

pred_svm_radial <- predict(svm_radial_train, d_test_svm)
table(pred_svm_radial)
## pred_svm_radial
##    Duplex Familycom Townhouse 
##        28       745       101
conf_matrix_radial <- confusionMatrix(pred_svm_radial, d_test_svm_truth)
conf_matrix_radial$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.462243e-01   8.031541e-01   9.291283e-01   9.602235e-01   8.340961e-01 
## AccuracyPValue  McnemarPValue 
##   4.119175e-24   1.022750e-01

Afterwards we inspect the accuracy of the model, which is 0.95. Which seems very high. Nevertheless, due to the large number of Familycomplex in this dataset, it can be a biased information. Let’s have a look at the confusion matrix:

conf_matrix_radial
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Duplex Familycom Townhouse
##   Duplex        18         8         2
##   Familycom     16       715        14
##   Townhouse      1         6        94
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9462          
##                  95% CI : (0.9291, 0.9602)
##     No Information Rate : 0.8341          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8032          
##                                           
##  Mcnemar's Test P-Value : 0.1023          
## 
## Statistics by Class:
## 
##                      Class: Duplex Class: Familycom Class: Townhouse
## Sensitivity                0.51429           0.9808           0.8545
## Specificity                0.98808           0.7931           0.9908
## Pos Pred Value             0.64286           0.9597           0.9307
## Neg Pred Value             0.97991           0.8915           0.9793
## Prevalence                 0.04005           0.8341           0.1259
## Detection Rate             0.02059           0.8181           0.1076
## Detection Prevalence       0.03204           0.8524           0.1156
## Balanced Accuracy          0.75118           0.8869           0.9227

Although, accuracy is 0.95, Duplex only could be correctly predicted 57% as seen in sensitivity.

10.4 Cross-validation

To validate our model, we performed three different cross validation model with the method svmLinear, svmRadial.

# Model
trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
svm_linear <- train(Bldg.Type ~ ., data = d_train_svm, method = "svmLinear", 
                    trControl=trctrl, preProcess = c("center", "scale"), 
                    tuneLength = 5)

trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
svm_radial <- train(Bldg.Type ~ ., data = d_train_svm, method = "svmRadial", 
                    trControl=trctrl, preProcess = c("center", "scale"), 
                    tuneLength = 5)

The confusion of matrix of the SVM with kernel linear is as follows:

##            Reference
## Prediction  Duplex Familycom Townhouse
##   Duplex         0         0         0
##   Familycom     35       723        17
##   Townhouse      0         6        93
##  Accuracy     Kappa 
## 0.9336384 0.7303775

As seen before and already discussed, with the linear model, the differentiation between Duplex and Familycomplex is not recognized. Next, we examine the results of the SVM with kernel radial:

##            Reference
## Prediction  Duplex Familycom Townhouse
##   Duplex        17         3         2
##   Familycom     17       721        14
##   Townhouse      1         5        94
##  Accuracy     Kappa 
## 0.9519451 0.8199688

With the cross validation model, we were able to fit a model with an accuracy of 0.95. Which, is only minimal better than our trained model. However, the sensitivity for the building type Duplex is in this case worse.

11 Neural Network

11.1 Neural Network for regression

11.1.1 Scaling train and test data

The main purpose of this chapter is to predict SalePrice with neural network. For using the neural Network to calculate a continuous output, we created used the same 70% as train data and 30% test data and removed all columns and removed all NA values to ensure proper application of neuralnet.

# Create dataframe with numerical variables
d_house_numeric_nn <- d_house %>% 
  select(where(is.numeric)) %>%
  ## Remove the variables with many NA
  select(-c(Lot.Frontage, Garage.Yr.Blt)) %>% 
  na.omit()

# Train data
d_train_numeric <- d_house_numeric_nn %>%
  slice(indices) %>% 
  na.omit()

# Test data
d_test_numeric <- d_house_numeric_nn %>%
  slice(-indices) %>% 
  na.omit()

d_test_numeric_truth <- d_test_numeric %>%
  pull(SalePrice)

Additionally, the numerical values will be scaled to increase efficiency of the neural network and better application.

# Scale Testing & Train data
d_max <- apply(d_house_numeric_nn, 2, max)
d_min <- apply(d_house_numeric_nn, 2, min)
d_house_numeric_scaled <- as.data.frame(scale(d_house_numeric_nn, center = d_min, 
                                              scale = d_max - d_min))
d_train_numeric_scaled <- as.data.frame(scale(d_train_numeric, center = d_min, 
                                              scale = d_max - d_min))
d_test_numeric_scaled <- as.data.frame(scale(d_test_numeric, center = d_min, 
                                             scale = d_max - d_min))

Here we compare the training, testing and the full dataset to ensure that the distribution is properly distributed.

# train test with scaled
boxplot(d_train$SalePrice, d_test$SalePrice, d_house %>% sample_frac(1) %>% pull(SalePrice),
        main = "Prepare for Training",
        names = c("Training", "Testing", "Full Data"))

We performed the same step also for the scaled data. Here we see that all the Saleprices are between 0 and 1.

# train test with scaled
boxplot(d_train_numeric_scaled$SalePrice, d_test_numeric_scaled$SalePrice, d_house_numeric_scaled %>% sample_frac(1) %>% pull(SalePrice),
        main = "Prepare for Training Scaled",
        names = c("Training", "Testing", "Full Data"))

11.1.2 Fitting models

We fitting several neural network models with the train data. We fit 5 different models with the different of 3 and 5 hidden layers. However, because each neural network model takes a long time to run, we only present three models with the least errors here. For the first model, all numerical variables will be the predictors. Afterwards, the high correlated variables to SalePrice are used as predictors in this chapter.

# Fit NN 1 with all variables
nn1_train <- neuralnet(SalePrice ~ .,data = d_train_numeric_scaled, hidden = 3, 
                       linear.output = TRUE)
plot(nn1_train)

# Fit NN2 with selected variables
nn2_train <- neuralnet(SalePrice ~ Overall.Qual + Gr.Liv.Area + Garage.Cars + 
                         Total.Bsmt.SF + Year.Built + Full.Bath, 
                       data = d_train_numeric_scaled, hidden = 3, 
                       linear.output = TRUE)
plot(nn2_train)

# Fit NN3 with selected variables
nn3_train <- neuralnet(SalePrice ~ Overall.Qual + Gr.Liv.Area + Garage.Cars + 
                         Total.Bsmt.SF + Garage.Area + X1st.Flr.SF + 
                         Year.Built + Full.Bath + Year.Remod.Add, 
                       data = d_train_numeric_scaled, hidden = 5, 
                       linear.output = TRUE)
plot(nn3_train)

11.1.3 Assessing model accuracy

# Prediction NN 1
pred_nn1_test <- compute(nn1_train, d_test_numeric_scaled %>% select(-SalePrice))
pred_nn1 <- pred_nn1_test$net.result * 
  (max(d_house_numeric_nn$SalePrice) - min(d_house_numeric_nn$SalePrice)) + 
  min(d_house_numeric_nn$SalePrice)

# Prediction NN 2
pred_nn2_test <- compute(nn2_train, d_test_numeric_scaled %>% select(-SalePrice))
pred_nn2 <- pred_nn2_test$net.result * 
  (max(d_house_numeric_nn$SalePrice) - min(d_house_numeric_nn$SalePrice)) + 
  min(d_house_numeric_nn$SalePrice)

# Prediction NN 3
pred_nn3_test <- compute(nn3_train, d_test_numeric_scaled %>% select(-SalePrice))
pred_nn3 <- pred_nn3_test$net.result * 
  (max(d_house_numeric_nn$SalePrice) - min(d_house_numeric_nn$SalePrice)) + 
  min(d_house_numeric_nn$SalePrice)

To be able to visualize the models, we created a function to see the precision between predicted and actual. Additionally we add the RMSE and R^2 for further illustration.

# Function for plot
compare_nn_theme <- function(model, title_number) {
  output <- ggplot(d_test_numeric, aes(x = model, y = SalePrice)) +
    geom_point() +
    geom_abline(intercept = 0, slope = 1) +
    labs(x = "Predicted Salseprice", y = "Actual Saleprice", 
         title = paste0("Neural Network Model ", title_number)) + 
    theme_minimal() +
    scale_y_continuous(labels = comma) +
    scale_x_continuous(labels = comma) +
    annotate("text",
             label = paste0("RMSE = ", round(RMSE(model, d_test_numeric_truth), 0), 
                            "\n", "R2 = ", round(R2(model, d_test_numeric_truth), 3)),
             x = 100000,
             y = 600000,
             col = "black",
             size = 5,
             hjust = "left")
  
return(output)}

# 3 plots together
egg::ggarrange(compare_nn_theme(pred_nn1, 1),
               compare_nn_theme(pred_nn2, 2),
               compare_nn_theme(pred_nn3, 3),
               ncol = 1,
               nrow = 3)

Based on the plot above, model 1 has the lowest RMSE and highest R squared. Therefore, model 1 is our final model for this section to predict SalePrice. It took very long to run the cross validation for the NN; we do not present this part in the report.

11.2 Neural Network for classification

11.2.1 Scaling train and test data

The scaled data should not include any NA Values. Therefore, we created a numeric data frame, which excluded predictors with many NA Values or such, which can not be scaled. Afterwards, Lot.Area was log transformed before the numeric data frame was scaled. As the goal is to predict categorical values, the same categorical values are used as in the SVM model. Afterwards, we use the same indices, which was in the prior splitting of our data set, before removing all NA values.

d_house_numeric_2 <- d_house %>% 
  select(where(is.numeric)) %>%
  select(-c(Lot.Frontage, Garage.Yr.Blt, Mas.Vnr.Area, 
            Bsmt.Full.Bath, Bsmt.Half.Bath, MS.SubClass))

d_house_numeric_2$Lot.Area <- log(d_house_numeric_2$Lot.Area)

d_max <- apply(d_house_numeric_2, 2, max)
d_min <- apply(d_house_numeric_2, 2, min)
d_nn_scaled <- as.data.frame(scale(d_house_numeric_2, 
                                   center = d_min, scale = d_max - d_min))

d_nn <- d_nn_scaled
d_nn$bldg.Type <- d_house$Bldg.Type

# Creating Character for Renaming
d_nn$bldg.Type <- as.character(d_nn$bldg.Type)

# Grouping the building types into 3 main categories
d_nn$bldg.Type[d_nn$bldg.Type == "1Fam"] <- "Familycom"
d_nn$bldg.Type[d_nn$bldg.Type == "2fmCon"] <- "Familycom"
d_nn$bldg.Type[d_nn$bldg.Type == "Duplex"] <- "Duplex"
d_nn$bldg.Type[d_nn$bldg.Type == "Twnhs"] <- "Townhouse"
d_nn$bldg.Type[d_nn$bldg.Type == "TwnhsE"] <- "Townhouse"

# Returning Factor for building type
d_nn$bldg.Type <- as.factor(d_nn$bldg.Type)

# Train data
d_train_nn <- d_nn %>%
  slice(indices) %>% 
  na.omit() 

# Test data
d_test_nn <- d_nn %>%
  slice(-indices) %>% 
  na.omit() %>%
  select(-bldg.Type)

d_test_nn_truth <- d_nn %>%
  slice(-indices) %>% 
  na.omit() %>%
  pull(bldg.Type)

11.2.2 Fitting models

We fit the Neural Network with the same five numerical values, which are in the SVM model, to identify the difference between those two fitting models. Two hidden layers are set for this model.

start.time <- Sys.time()
set.seed(12)
nn_bldg_train <- neuralnet(bldg.Type ~ Lot.Area + Overall.Qual + Year.Built + 
                             TotRms.AbvGrd + Garage.Area, data = d_train_nn, 
                           hidden = c(4, 4), stepmax = 1e+07, rep = 1)

end.time <- Sys.time()
time.taken <- end.time

Below we see the resulting network including the weights.

#plot(nn_bldg_train)
test_results <- neuralnet::compute(nn_bldg_train, d_test_nn)

pred_bldg_nn <- test_results$net.result

11.2.3 Assessing model accuracy

After resealing we predicted the model and identified an Accuracy of 0.95.

pred_nn_test <- apply(pred_bldg_nn, 1, which.max)
test_pred_nn <- factor(levels(d_test_nn_truth)[pred_nn_test], 
                       levels = levels(d_test_nn_truth))
conf_matrix <- confusionMatrix(d_test_nn_truth, test_pred_nn)
conf_matrix
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Duplex Familycom Townhouse
##   Duplex        16        18         1
##   Familycom     10       714         5
##   Townhouse      0        14        96
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9451          
##                  95% CI : (0.9278, 0.9592)
##     No Information Rate : 0.8535          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7982          
##                                           
##  Mcnemar's Test P-Value : 0.05632         
## 
## Statistics by Class:
## 
##                      Class: Duplex Class: Familycom Class: Townhouse
## Sensitivity                0.61538           0.9571           0.9412
## Specificity                0.97759           0.8828           0.9819
## Pos Pred Value             0.45714           0.9794           0.8727
## Neg Pred Value             0.98808           0.7793           0.9921
## Prevalence                 0.02975           0.8535           0.1167
## Detection Rate             0.01831           0.8169           0.1098
## Detection Prevalence       0.04005           0.8341           0.1259
## Balanced Accuracy          0.79649           0.9200           0.9615

11.2.4 Optimize Network Structure

First, we need to remodel the data due to some limitations of caret. And have a look at the different models. We performed a tuning of the hidden layers for the model.

xor <- model.matrix(~ bldg.Type, na.omit(d_nn))
xor[,1] <- ifelse(xor[,2] %in% 0 & xor[,3] %in% 0, 1, 0)
colnames(xor)[1] <- 'bldg.TypeDuplex'

nn_mod <- cbind(xor, d_nn %>% select(-bldg.Type))
train_mod <- nn_mod %>% slice(indices)
start.time <- Sys.time()
set.seed(12)
formula = bldg.TypeDuplex + bldg.TypeFamilycom + bldg.TypeTownhouse ~ Lot.Area + 
  Overall.Qual + Year.Built + TotRms.AbvGrd + Garage.Area
models <- train(formula, train_mod,
                method="neuralnet",
                ### Parameters for layers
                tuneGrid = expand.grid(.layer1=c(2:4), .layer2=c(0:4), .layer3=c(0)),
                ### Parameters for optmisation
                learningrate = 0.01,
                threshold = 0.001,
                stepmax = 1e+07)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
## Time difference of 8.934165 mins
plot(models)

12 Optimization version 1

After analyzing the data set with the sale price and the building type, let’s move to the optimization chapter. As an investment firm, we plan to buy then renovate several dwellings and sell them afterward. First, it is important to identify whether buying and renovating different buildings can sell them to the market with a certain margin. On average, the profit margin for a Duplex house is 100,000 USD, a Townhouse is 50000 USD, and a Family complex is 15,000 USD (based on our market insights and experience).

The profit from this project cannot be reinvested. Therefore, the profit goes directly to our investors. We assume that all have the same price and that the capital is sufficient for buying 120 houses. The internal guidelines demand to buy each building as a whole. Therefore, buying part of a building is not an option. A portfolio constraint is that Townhouses cannot exceed 10% of all buildings. Duplex is the most profitable building type. However, it is only possible to buy them from the largest real estate agency. Therefore, they require us for each Duplex to buy at least 4 Family complex houses.

With our optimization, the main question now is how many building types we can buy each to maximize the profit.

First and foremost, let’s define the number of each building type as follows:

  • D = number of Duplex
  • T = number of Townhouse
  • F = number of Family complex

objective \[ maximize profit = 100000 \times D + 50000 \times T+ 15000 \times F \]

constraints \[ D \ge 0\\ T \ge 0\\ F \ge 0\\ D + T + F \le 120\\ F \ge 4D\\ D + F \ge 10T \]

# setting the coefficients of the objective function
objective <- c(100000,#Duplex Complex
               50000, #Townhouse Complex
               15000) #Margin Family Complex

# setting the constraints 
A <- matrix(c(1, 0, 0,   #Duplex cannot be negative
              0, 1, 0,   #Townhouse cannot be negative
              0, 0, 1,   #Familycomplex cannot be negative
              1, 1, 1,   #Sum of all buildings cannot be bigger then 120
              -4, 0, 1,  #For each Duplex, we need to buy 4 Family complex
              1, -9, 1), #Townhouse cannot be more then 10%
              nrow=6, byrow=TRUE)          
B <- c(0,0,0,120,0,0)                               # constraint rhs        
dir  <- c(">=", ">=", ">=", "<=", ">=", ">=")   # direction of inequality

# we make sure the number of paintings are integers!
optimum <-  lp(direction="max",  objective, A, dir,  B, all.int = TRUE)
## To maximize our margin, we need to buy and renovate 22 Duplex buildings, 10 Townhouses and 88 Family complexes. With this combination our margins is amounting, 4020000 USD.

Our margin for our investors is 4’020’000 USD. We noted that we only buy 10 Townhouses. The maximal allowed amount would be 12. However as the Duplex is more profitable we purchased 22 of them, which is for times more then the Family complex. Therefore all our constraints were fulfilled.

13 Optimization version 2

In the last part of our work, we create three fictive optimization problems and solve it using R.

13.1 Problem 1

A street painting artist is being commissioned to create wall paintings in a very large street art exhibition project. He can create as many painting as he would like to. His time, however, is limited to a total of 500 hours. The curator of the project has specified two options for the painting material qualities: \(S\) = painting of material quality “superior”, and \(P\) = painting of material quality ”premium”. These generate the artist a profit of 240 and 160 CHF per painting, respectively. The artist has 1350 paint sprays available for his entire project. The superior painting quality \(S\) requires 20 sprays while premium painting \(P\) requires 10 sprays. The painting time is 6 and 5 hours, for creating one painting of quality \(S\) or \(P\), respectively.

The artist wonders about the optimal number of painting of each quality type such that his profit is maximized.

  • \(S\) = number of painting with “superior” material quality
  • \(P\) = number of painting with “premium” material quality

The optimization problem can be formulated as follows:

objective

\[ maximize [profit = (240S) + (160P)] \]

constraints

\[ 20S + 10P <= 1350\] \[ 6S + 5P <= 500 \]

The optimization in R is as follows:

# setting the coefficients of the objective function
objective <- c(240, 160)
# setting the constraints 
A <- matrix(c(20, 10, 6, 5, 1, 0, 0, 1),
            nrow=4, byrow=TRUE)             # constraint coefficients
b <- c(1660, 500, 0, 0)                       # constraint rhs      
dir  <- c("<=", "<=", ">=", ">=")             # direction of inequality
# we make sure the number of paintings are integers!
optimum <-  lp(direction="max",  objective, A, dir,  b, all.int = TRUE)
cat("The maximum profit is", optimum$objval, "\n")
## The maximum profit is 19920
cat("The optimal number of paintings are (S,P) = (",optimum$solution,")")
## The optimal number of paintings are (S,P) = ( 83 0 )

The artist should make 44 paintings with material of superior quality and 47 paintings with paint of premium quality. He will be able to generate 18080 CHF for his work, which he can accomplish in \(44 \times 6 + 47 \times 5 = 499 \ hours\).

13.2 Problem 2

Now we assume, that the artist is required to create at least 50 painting with the superior material quality, and he has access to 1550 spray. Let us see how these change the situation:

# setting the coefficients of the objective function
objective <- c(240, 160)
# setting the constraints 
A <- matrix(c(20, 10, 6, 5, 1, 0, 0, 1),
            nrow=4, byrow=TRUE)             # constraint coefficients
b <- c(1550, 500, 50, 0)                        # constraint rhs        
dir  <- c("<=", "<=", ">=", ">=")             # direction of inequality
# we make sure the number of paintings are integers!
optimum <-  lp(direction="max",  objective, A, dir,  b, all.int = TRUE)
cat("The maximum profit is", optimum$objval, "\n")
## The maximum profit is 19280
cat("The optimal number of paintings are (S,P) = (",optimum$solution,")")
## The optimal number of paintings are (S,P) = ( 69 17 )

In the second scenario, the artist should make 69 paintings with material of superior quality and only 17 paintings with paint of premium quality. He will be able to generate more profit, in this case 19280 CHF for his work, for which he needs again \(69 \times 6 + 17 \times 5 = 499 \ hours\).

13.3 Problem 3

Had the artist have 1660 sprays, it would be optimal if he would have painted all his work with the superior quality. This can be seen as follows:

# setting the coefficients of the objective function
objective <- c(240, 160)
# setting the constraints 
A <- matrix(c(20, 10, 6, 5, 1, 0, 0, 1),
            nrow=4, byrow=TRUE)             # constraint coefficients
b <- c(1660, 500, 50, 0)                      # constraint rhs      
dir  <- c("<=", "<=", ">=", ">=")             # direction of inequality
# we make sure the number of paintings are integers!
optimum <-  lp(direction="max",  objective, A, dir,  b, all.int = TRUE)
cat("The maximum profit is", optimum$objval, "\n")
## The maximum profit is 19920
cat("The optimal number of paintings are (S,P) = (",optimum$solution,")")
## The optimal number of paintings are (S,P) = ( 83 0 )

In the third scenario, the artist should make only 83 paintings with material of superior quality. He will be able then to generate even more money, i.e., 19920 CHF, for which he need one hour less work \(83 \times 6 = 498 \ hours\).

14 Conclusion

The project’s purpose is to predict various house factors in Ames, Iowa, United States, by applying six different machine learning techniques and evaluating these models. Our dataset presents 2930 observations with 82 features describing diverse aspects of a house. We proactively conducted the variable comprehension, exploratory data analysis, visualized numerous plots, and built several models in each type of machine learning technique to accomplish the task. Afterward, evaluating all the models, analyzing the residuals, assessing model accuracy with in and out-sample performance, or testing cross-validation are essential steps to select our final models under machine learning methods.

“No pain, no gain” - many challenges appeared during the process. First, testing numerous models took time and computing power. While some are relatively fast, others require much more time to predict. Secondly, another confrontation was finding a suitable countable value for the GLM with the Poisson family. The two potential options were the number of cars and rooms. While the number of cars only goes up to five, the best alternative was the number of rooms. Thirdly, the dataset was already quite complex with many variables; the initial analysis and determination of relevant features were excessively time-consuming. With multiple analyses from different perspectives, we could still focus on the variables having an effect on the response variable to improve the models. The table below summarizes our results. Each model has its advantages and disadvantages. In general, our models have an R-squared or accuracy of at least 0.9. Regarding the first three models to predict the house price, the three R-squared are up to 0.92. Besides, all of the RMSE below the actual price’s standard deviation, which is a good sign. NN is the winner to predict the house price, followed by SVM and LM. Nevertheless, the gap among these three models’ R-squared and RMSE is relatively minor.

With respect to classification with building types, the three last models generate an accuracy of around 0.95. Note that the accuracy in the table above presents the accuracy in the confusion matrix. One highlight is that we applied both Support Vector Machine and Neural Network in the classification problem with the same predictors. The goal was to identify a better model. The accuracy of both models is about 0.95. However, the Family complex is overrepresented than the other building types. In the case of unbalanced data, Kappa might be a better indicator. However, both models performed with a Kappa about 0.80 similar. Although both models show similar results, the Support Vector Machine with the radial kernel would be considered more favorable because of much faster computing.

15 Reflections and limitations

Noticeably, the result might change by optimizing different hyperparameters with different expert knowledge levels. Each machine learning method also serves various purposes. Therefore, applying models with a thoughtful, analytical, critical mindset is crucial. Understanding the fundamentals of the machine methods when fitting model is essential. The class materials are the most trustworthy and helpful source. Grounded by these materials and further research, we conducted every model with contemplation to ensure the proper approach. Also, measuring the accuracy in a comparable unit is very noteworthy.

The study also comes with some limitations. First, due to specific requirements, only limited numbers of machine learning techniques could be applied. Secondly, the page restriction only allows us to display only part of our in-depth analysis. For some of the applied methods, the data was suitable to a lesser degree than for others. Varied directions for upcoming research could be developed. For example, future studies could gather different housing datasets to expand and cross-check each machine learning model. Besides, many other exciting machine learning models such as random forest could be performed. Furthermore, a more comprehensive analysis could be done given a longer working period. With further knowledge in machine learning and a deeper understanding of the relevant domain, we would be able even to perform better predictions. Nevertheless, we have already built our first fundamental knowledge with this work, which will help us in the future.